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EDITORIAL NOTE 


The present issue of Psychometrika is devoted in its entirety to the publication of 
papers presented by invitation at the Twenty-fifth Anniversary Meeting of the Psycho- 
metric Society, Chicago, Illinois, September 6-7, 1960. The program was arranged by a 
special committee of the Psychometric Society, Paul Dressel, John Milholland, and Charles 
Wrigley (Chairman). 

In some cases authors have modified their papers before submitting them for publi- 
cation. Titles have been particularly subject to change. In all cases, editorial changes 
have been kept to a minimum. 

One paper, presented at the Chicago meetings, does not appear here. Leon Festinger 
preferred that his remarks, “Cautions and comments,” not be included in this published 
record, having prepared his presentation as informal discussion rather than as a more 
formal article. 

Authors are to be congratulated, not only for the quality of their contributions, 
but also for their cooperation in having provided good copy sufficiently in advance of 
publication to allow this special issue of Psychometrika to appear on schedule. 


ERRATUM 


The following correction should be made in the paper, Stone, M., Models 
for choice-reaction time, Psychometrika, 1960, 25, 251-260. The expressions 
on p. 258 for vo and v, should be increased by a(1 — a)(#, — fip)?/(1 — a — 8)? 
and B(1 — 8)(”%, — %)’/(1 — a — 8)’, respectively. The right-hand sides 
of equations (3) and (4) on p. 254 should be increased by 


[J(a, B)B(1 — 8) — J(B, a)a(1 — @))(%, — %)’/(1 — a — 8)’ 
and 


[J(q, B)B(L — 8B) — J(B, a)a(l — a)|(T., — T..)’/(L — a — 8)’, 


respectively. 
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Irving Lorge 


With the death on January 23rd of Irving Lorge the Psychometric 
Society has lost another of its original founders and past presidents, and 
its members have lost an esteemed colleague and for many of them a valued 
friend. Professor Lorge, who was 55 years old, died suddenly and entirely 
unexpectedly from a heart attack. He is survived by his wife, Sarah, and 
two daughters, Paula Lee and Beatrice Susan. 

It was in 1927 that Irving Lorge came to Teachers College to work 
with my father, Edward L. Thorndike, as a research assistant in the Institute 
of Psychological Research, Except for a period of government service during 
World War II, he stayed there the rest of his life. He moved up rapidly from 
research assistant to valued associate and collaborator in many of my father’s 
enterprises, and went on to become Executive Officer of that same Institute 
of Psychological Research and Professor in the Department of Psychological 
Foundations and Services. 

The very diversity of Dr. Lorge’s interests and activities makes it 
difficult to type or categorize him. Basically he was interested in research— 
almost anybody’s research on almost anything. Whatever problem was 
brought to him became for the time, and sometimes for quite a long time, 
his problem. He was excited about it, and brought to bear upon it a rich 
background of knowledge and skill. It is perhaps for this reason that he was 
so widely in demand as a research advisor—both by students at Teachers 
College and by mature investigators throughout the country. Much of his 
professional energy was devoted to helping others with their research prob- 
lems, both in the design and planning of their studies and in the analysis 
of their data. 

In his early days especially, Dr. Lorge’s outspoken remarks occasionally 
infuriated certain of his more venerable and less research-oriented colleagues. 
On the surface sometimes brusque and even seemingly harsh, he evoked 
fairly acute anxiety in his initial contact with the more timorous students. 
But basically he was the kindest and most supporting of men. Thus, though 
he could be devastating in his criticism of what he considered to be slip- 
shod thinking in a student’s research, he was nevertheless extremely generous 
of his time and counsel in carrying that same student through to a satisfactory 
outcome. Those with whom he worked were devoted to him. 

His own work covered many fields, and it is hard to know which he 
valued most highly. One major interest was certainly language and communi- 
cation. He expanded the work on word counts into a count of specific mean- 
ings, providing much of the initiative for the Semantic Count of English 
Words (published with E. L. Thorndike). He became an expert on read- 
ability, contributing to the techniques for appraising reading materials and 
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educating many in high places on the importance of writing so that others 
could read. 

An early exposure to research on the psychology of the adult led to a 
continuing interest in the problems of aging. He was convinced that early 
results showing the decline of ability gave a badly distorted picture, and 
presented evidence to show how the age curve was related to amount of 
pressure for speeded performance, on the one hand, and to continuing educa- 
tion, on the other. He continued to contribute both research studies and 
professional leadership to the field of aging up to the time of his death. 

Throughout his life, he always maintained an interest in psychometric 
techniques and psychometric theory. Much of his early teaching at Teachers 
College was in statistics, and over the years his teaching-and-service IBM 
installation at the College provided an introduction to machine data processing 
for hundreds of students. He helped to develop testing materials for the Air 
Force and the Army Specialized Training Program during World War II. 
For many years he carried on a series of testing services for units within 
Columbia University and in the New York City School System. I was fortu- 
nate to be associated with him as a junior author in his major published 
test series, the Lorge-Thorndike Intelligence Tests. 

But these are only a few of his interests, and some might count others 
as more central. He was deeply interested in the problems of the gifted, 
and in research about them and educational provision for them. He was for 
a time something of a rural sociologist, investigating small rural communities 
during the depression of the 1930’s. He has an extensive early bibliography 
of contributions to the psychology of learning. For a period after World 
War II he became a social psychologist, directing a series of studies on group 
problem solving and decision making. There are, I am sure, other types of 
activities of which I am not even aware. 

We who worked most closely with Irving Lorge find it hard to realize 
that this vital personality will no longer be with us. The Psychometric 
Society, too, is the poorer for his passing. 


Columbia University Teachers College Robert L. Thorndike 
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FECHNER: INADVERTENT FOUNDER OF PSYCHOPHYSICS 


Epwin G. Borine 
HARVARD UNIVERSITY 


Certainiy the most interesting tool that science employs in its perpetual 
pursuit of knowledge is the scientist, with his enthusiasms, egoisms, and 
prejudices, his inevitable unconscious attitudinal orientation in the consensus 
of contemporary opinion, which we call the Zeitgeist. One wonders what 
science would be like if automation could take over completely. Are there 
mechanical equivalents for jealousy and pride and pigheadedness and insight, 
and those other interacting personal forces that contribute to contemporary 
truth in the scientific field? 

Fechner never tried to found psychophysics or a new experimental 
psychology. He was, in his own estimation in those last forty-five years of 
his life, a philosopher, fighting what he regarded as the crass materialism of 
his day, the Nachtansicht or “night view,” as he called it, and promoting the 
faith that mind and soul are the ultimates of reality, the Tagesansicht or 
“day view.” This favoring of the clear philosophical vision in the day view 
as opposed to the materialistic darkness of the night view is Fechner’s 
panpsychism, a faith that seems mystical to most modern scientists, partly 
because the German word Seele does not distinguish between mind and 
soul, between that which compares the sensory intensities of two lifted 
weights and whatever it is that persists after the body’s death. 

Let us take time to recall what Fechner did with the 86 years of his 
life between 1801 and 1887. At the age of 16 he went to Leipzig to study 
physiology, which in those days meant taking a doctorate in medicine. He 
stuck to physiology for only seven years and then turned to the study of 
physics and mathematics. He began work in this new field humbly, making 
his early reputation by the translation into German of French handbooks 
of physics and chemistry. At the age of 33, after some research in the new 
physics of electricity, he was made professor of physics at Leipzig; he held 
that post until 1839, when he resigned for reasons of poor health. 

For 15 years he had been a physicist, but three other interests were 
emerging. Under the nom de plume of Dr. Mises, he provided scope for 
his humanistic interests by beginning a series of essays on various topics, 
the first of which was a satire on the current medical faith in the potency 
of iodine: Proof that the Moon is Made of Iodine (1821). Out of this side of 
Fechner’s nature emerged his vigorous support of spiritualism as opposed 
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to materialism: he wrote The Little Book on Life after Death in 1836. On 
the scientific side there was his growing interest in sense-physiology, and 
presently his papers on subjective colors and afterimages in 1838 and 1840. 
It must have been then that he permanently injured his eyesight by gazing 
too long at the sun through colored glasses. 

There followed from 1839 to 1851 a dozen years of retirement in Leipzig. 
During the first three or four years he suffered from some form of psycho- 
neurosis, and it would seem that this German academic never quite escaped 
from unusual seclusiveness as he lived on in Leipzig outside of the University. 
It was during this period that his concern with the ‘day view’’ of reality, 
with panpsychism, emerged. In 1848 he published Nanna, a volume named 
for the goddess of flowers, in which he argued for the mental life of plants. 
Then in 1851 came the Zend-Avesta, with a subtitle specifying that the 
volume was about the things of heaven and the life to come. 

Actually this philosophical period of Fechner’s life extended altogether 
over 43 years from 1836 to 1879, during which, in writing in 1861 on the 
problem of the soul, he remarked that he had already called four times to 
a sleeping world which had not awakened, and he was now calling a fifth 
time, and “‘if I live, I shall call yet a sixth and a seventh time, ‘Steh! auf!’ 
and always it will be the same ‘Steh! auf!’ He did call twice more, the seventh 
in 1879 in the volume on the “‘day view and the night view.” 

Fechner’s philosophy won him little respect among the scientists, nor 
any great acclaim by the philosophers. William James took him seriously, 
hailed the Zend-Avesta when he belatedly discovered it, told Bergson that 
Fechner ‘‘seems to me of the real race of prophets.’”’ James described Fechner’s 
philosophy in A Pluralistic Universe [4] and related Fechner’s views to his 
own. It was this excitement about spiritualism that pushed Fechner into 
psychophysics—strange parentage it was for psychophysics. 

On that now famous morning of 22 October 1850, Fechner, lying in 
bed and puzzling how to do away with materialism, had the thought that, 
since conscious events are necessarily related to events in the brain—at 
least in the living person—an equation between the two systems would have 
the effect of identifying them and of abolishing the dualism, abolishing it 
in favor of a psychic monism which was what Fechner wanted. If he knew 
about Weber’s law, he did not think about its relevance then. Later, however, 
he realized the significance of Weber’s experiments and also of Daniel 
Bernoulli’s contention in 1738 that fortune morale (psychic) is proportional 
to the logarithm of fortune physique (physical). Now Fechner thought: 
sensation is a function of its stimulus; you can measure stimuli, but how can 
you measure sensations? He concluded that sensory magnitude can be 
measured in terms of sensitivity, and he laid down the general outlines 
of his program in Zend-Avesta, the book about heaven and the future life. 
Imagine sending a graduate student of psychology nowadays to the Divinity 
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School for a course in immortality as preparation for advanced experimental 
work in psychophysics! How narrow we have become! 

After the publication of the Zend-Avesta Fechner had 14 years of intense 
activity in psychophysics, the first 9 of them in experimentation. After that 
came the epochal event, the publication of the Elemente der Psychophystk 
in two parts in 1860, the occasion that we celebrate today. It was the psycho- 
physics, not the panpsychism, that attracted attention. Fechner’s alleged 
measurement of sensation met with criticism and objection which indeed 
showed its importance in the current scientific belief that belonged to the 
mid-nineteenth century. History was now ready for a scientific psychology, 
but how can you become scientific unless you can measure your phenomena? 
Fechner’s scheme was plausible and the need for sensory measurement led 
some to overlook its defects. He argued that sensation cannot be measured 
directly but can be indirectly. What you do is to measure sensitivity by 
determining differential thresholds; then, to find the magnitude of the sensa- 
tion, you calculate the number of just noticeable differences (jnd) from zero 
sensation at the absolute threshold to the sensation that is being measured. 
Of course, this business of counting up jnd to measure a sensation met with 
the question: How do you know that all jnd are equal? And indeed, when 
measured by certain other scales, jnd may turn out not to be equal. 

About 1865 Fechner turned from psychophysics to a new interest in 
experimental esthetics, publishing his classic in that field in 1876. The world, 
however, would not leave him free. Applause from some reinforced criticism 
from others, and Fechner was forced—for it was not easy for a German 
scholar to let criticism go unanswered—to reply to objections and to defend 
his measurement of sensation. He must have thought that he would himself 
have been content to go on crying to a sleeping world that the measurement 
of sensation had now made plausible man’s grasp on immortality; but when 
the world at last awoke, it was to the wrong cry—unfortunately for Fechner, 
fortunately for us. 

Tolstoy, speaking of History in his War and Peace and arguing for 
cultural determination—and thus indirectly against the importance of Great 
Men in the determination of History—remarked that ‘‘History, the un- 
conscious, general hive-life of mankind, uses every moment of the life of 
kings as a tool for its own purposes ... A king is History’s slave.’’ History 
itself is the sum of the myriad of events that make it up, and every one of 
these is caused, though there be so many that prediction from a knowledge 
of them becomes impossible. As to the Great, Tolstoy imagined a young 
cavalry commander who achieved high honor because, exuberant with good 
health, unaware of danger but without orders, he led his men at a gallop 
across the level plain in what turned out to be a successful charge. So with 
Fechner. He attacked the ramparts of materialism and was decorated for 
measuring sensation. 
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Scientists, for the most part, believe in the operation of deterministic 
causality between events, yet they also like in ordinary professional con- 
versation to leave room for the originality of Great Men. There is a contra- 
diction here. To see the Great Man’s important contribution to thought, 
as a consequence of the combination of commonly accepted knowledge, plus 
certain ideas or discoveries of other men, plus one or two coincidences of the 
kind of insight that brings thitherto unrelated ideas into useful connection, 
is largely to reduce greatness to a link in a complex causal chain. When the 
whole story is told of an invention or a discovery or the founding of a school, 
when as much attention is given to the antecedents as to the consequences 
of the great event, iis greatness seems to diminish, its importance becomes 
less as it spreads over a broader range of activities and a longer span of time. 

The case with Fechner goes about like this. The times were ready for 
scientists to get hold of mind by measuring it. Sensory thresholds had been 
determined as much as a hundred years before Fechner. The physiologists 
were already experimenting with sensation—Johannes Miiller with specific 
nerve energies in 1826, Ernst Heinrich Weber with tactual sensibility in 1834. 
To contemporaneous thought Herbart had contributed the notion of the 
measurement of ideas, while denying the possibility of experimenting on 
them; and he had made Leibnitz’s concept of the threshold well known. 
Lotze published his Medical Psychology: The Physiology of the Mind the year 
after Fechner’s Zend-Avesta. It was in this setting that Fechner had on 
22 October 1850 his important insight about measuring sensation and relating 
the measures of sensation to the measures of their stimuli. 

Fechner’s claim to originality of epoch-making magnitude lies in this 
insight. His claim to honor lies in his careful and laborious work through the 
decade of the 1850’s, and the crucial character of the Elemente when it finally 
came out in 1860. He is credited with having given experimental psychology 
the three fundamental psychophysical methods still in constant use today, 
but actually the method of limits goes back to 1700 and may be said to have 
been formalized by Delezenne in 1827, whereas the method of constant 
stimuli was first used by Vierordt in 1852. Only the method of average 
error belongs to Fechner, and that only half, for he and his brother-in-law, 
A. W. Volkmann, developed it in the 1850's. What Fechner did in the Elemente 
was to present the case for sensory measurement and write the systematic 
handbook for psychophysics, a new field of scientific endeavor. In this sense 
he founded psychophysics as a field that is ancillary to the establishment 
of the philosophy of panpsychism. 

It is conceivable that the Elemente might have fallen flat, as the laborious 
production of a queer old mystic in Leipzig who went to endless pains to 
prove a point that most wise men do not believe. The times, however, were 
ripe for psychophysics. Immediately the methods began to be used, and 
new facts began to accumulate, while the argument waxed about Fechner’s 
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interpretation of what it is that the methods do, about whether sensation 
had actually been measured after all. 

In general, the greatness of Great Men is a subjective addition to history 
which posterity adds in order to understand history. History is continuous 
and sleek. Great Men are the handles that you put on its smooth sides. You 
have to simplify natural events in order to understand them, and science 
itself is forced to generalize in the interest of economy of thinking. Just so 
the history of science singles out events, schools, trends, and discoveries and 
eponymizes them, that is to say, it names them for a central figure. Fechner 
has become the name for a change in the newly developing scientific psy- 
chology, for the gradual acceptance of the belief that the fleeting and eva- 
nescent mind—consciousness—can be measured. That had to happen before 
anything else could take place in respect of scales and measurement in the 
psychological sphere. 

William James admired Fechner, the philosopher, but deplored Fechner, 
the psychophysicist. Almost everyone knows how he said, “But it would 
be terrible if even such a dear old man as this could saddle our science forever 
with his patient whimsies, and, in a world so full of more nutritious objects 
of attention, compel all future students to plough through the difficulties, 
not only of his own works, but of the still drier ones written in his refuta- 
tion ... The only amusing part of it is that Fechner’s critics should always 
feel bound, after smiting his theories hip and thigh and leaving not a stick 
of them standing, to wind up by saying that nevertheless to him belongs the 
imperishable glory, of first forming them and thereby turning psychology into 
an exact science.”” Well, say I, isn’t that sort of glory as nearly imperishable 
as one could expect ever to get? But then, of course, James did not agree 
with Tolstoy. He thought that there are Great Men. 

Only this year Henri Piéron has expressed a thought quite similar to 
James’ except that Piéron and James are on opposite sides of the Fechner 
fence. Piéron wrote in concluding a centennial article about the importance 
of Fechner’s psycho _ sics: ‘‘And thus the shade of Fechner does not cease 
in our day to hover uver many American laboratories of experimental psy- 
chology which without doubt never hear tell of Fechner except when Stevens 
declares that nothing of Fechner’s work remains.” That is hardly fair to 
us Americans. Stevens’ students hear about Fechner, and scattered over 
America are a small coterie of psychologists who seldom miss noting the 
date when 22 October comes around. 

And now here are we celebrating the centenary of the Elemente. In 
complimenting Fechner we compliment ourselves, of course. A centenary is 
virtually a religious rite. We could not be pleasing Fechner now, even if he 
had justified his contribution to psychophysics by eventuaily finding himself 
immortal. What we need for our own use are symbols of our faith, our faith 
in science and measurement and quantification. It is right to hang Fechner’s 
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picture on the wall. It is a symbol of what we will to have important. It is 
right to be glad when your son is born on 22 October. It is right to atomize 
the smooth flow of History by the eponymy of great names. The scientist 
may be a determinist in his model-making, but as an active scholar and 
experimenter he needs more motivation than simple description and the 
generalization of observation can provide. He needs humor and reverence, 
as well as a little distortion of the complacency of history, to keep his prime- 
mover going, and what good is the scientific machine without a prime-mover? 

It was given to Fechner to have the idea of measuring sensation in- 
dependently of the measure of its material stimulus. In his own opinion he 
succeeded. Posterity doubts the validity of his procedure or even condemns 
it. Yet, if posterity has something better, it grew out of what Fechner pro- 
vided. All honor then to the man who, resolved to achieve one goal, actually 
reached another, who because of his patient insistence remains the central 
figure at the absolute threshold at which measurement entered psychology. 
It may be said that he gave to sensations their magnitudes. 
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LINEAR AND MULTIDIMENSIONAL SCALING* 


Haroip GULLIKSEN 
PRINCETON UNIVERSITY 
AND 
EDUCATIONAL TESTING SERVICE 


Fechner’s work [12] was motivated by an interest in the mind-body 
problem concentrating particularly upon the question, ‘‘What is the relation- 
ship between the measurable characteristics of the physical stimulation and 
the subjective characteristics of the sensations produced?”’ Some of Fechner’s 
work was devoted to experiments in areas where the physical measurements 
that can be made on the object are not functionally related to the psycho- 
logical characteristics. 


Paired Comparisons and the Law of Comparative Judgment 


In his work on aesthetics, Fechner [13] had his subjects choose ‘“‘the 
best one” in the group—the best picture, the best proportioned rectangle, 
etc. Recognizing that the particular combination and permutation used 
could influence the choice, he regarded a group of two as a refinement, as 
the easiest to control. Paired comparisons was not regarded as a method of 
scaling the set of stimuli compared. 

Titchener discusses paired comparisons in his “qualitative” manual [43]. 
Thorndike in his monograph on handwriting [40] used paired comparisons of 
selected pairs (pairs of stimuli adjacent along the scale) in order to set up 
a handwriting scale. He used the principle that equally often noticed differ- 
ences are equal, and also used the normal curve transformation so that 
various percentages of judgments ‘‘z is better than 7’’ could be converted into 
distances on the handwriting scale. He did not use the entire set of paired 
comparisons as a check on the consistency of the judgments. Thorndike’s 
scaling methods were simply rules for getting the scale values. 

Thurstone [41] formulated the law of comparative judgment and showed 
that the entire set of 3n(n — 1) paired comparisons could be used to check 
on the agreement of the data with the law. He was, as far as I am aware, 
the first to formulate a law, the law of comparative judgment, to be used 
with the method of paired comparisons as a scaling technique. 


ow as a technical report in connection with research partially ig <r by 
Office of Naval Research contract Nonr 1858-(15) and National Science Foundation 
Grant G-3407 to Princeton University, and by the Educational Testing Service. otal 
tion of any part of this material is permitted for any purpose of the United States Govern- 
ment. 
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Thurstone also questioned the constancy of the ‘just noticeable dif- 
ference’’ (jnd), or the ‘‘equally often noticed differences,’’ as did Newman [34], 
Stevens, and others. Thurstone, however, solved the problem. He showed 
that the concept of the discriminal dispersion provided the answer. If dis- 
criminal dispersions are constant for sensations produced by stimuli of various 
magnitudes, then pairs of stimuli separated by a jnd give rise to sensations 
that are psychologically equidistant. If discriminal dispersions are unequal, 
then pairs of stimuli separated by a jnd give rise to pairs of sensations that 
are not psychologically equidistant. Detailed explanations of the conditions 
under which equally often noticed differences wou!d be psychologically equal 
were given by Thurstone in three articles published in 1927, (see [42], ch. 
2, 3, and 5). Ii is interesting to note that even with the clear exposition given 
by Thurstone in 1927, the problem of lack of constancy of the jnd is still 
a topic for discussion. 

Thurstore saw that in order to get a grip on problems where only a 
subjective scale was involved, one could not deal with the logarithmic law 
as statcu by Fechner, or a power law as proposed by Plateau and by Stevens 


[39]. These laws oota us ve've physical measurement as well as psychological 
mene’ oo ont Tbusetene found that one could state a law interrelating the 
Verio. ‘er. utagee of judgments 7 greater than j, and obtain from such 


perceniages a test of internal consistency of the data and scale values for 
the stimuli. 

The law of comparative judgment utilizes a normality assumption, a 
linearsty assumption, and a difference assumption. The difference assumption 
means that the subjects react to the difference in the scale values of the 
stimuli. The linearity assumption means that one is dealing with a set of 
objects such that the distance from A to B plus the distance from B to C 
will equal the distance from A to C. The normality assumption specifies that 
the subjective scale values for a given stimulus have a normal distribution. 

Subsequent investigators (Luce [26], Mosteller [31, 32, 33], Bradley and 
Terry [3]) have shown that substitutions may be made for any of these 
postulates, thereby obtaining other sorts of laws. In place of the normal 
distribution, other investigators have substituted the are sine distribution 
or the logistic. In place of the difference assumption, a ratio assumption has 
been used, and the subject is said to be responding to a ratio of the scale 
values instead of to a difference. Correspondingly, instead of the linearity 
assumption, one might assume that the A/B ratio multiplied by the B/C 
ratio will equal the A/C ratio. 

It is important to provide a method for checking on the validity of the 
set of assumptions. Thurstone proposed to show that all of these paired 
comparison judgme. 3 involved in n stimuli, that is n(n — 1) judgments, 
could be predicted from only n different scale values of the stimuli, when 
these scale values were computed according to the equations which followed 
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from the law of comparative judgment. He undertook several experiments 
and found that the experimental proportions of judgments 7 greater than j 
were predicted reasonably accurately from such a set of scale values. 

Similar procedures have been followed by subsequent investigators, such 
as Bradley and Terry [3] and Luce [26]. Some interesting relationships among 
the various theories have come to light. The ratio assumpticn and the dif- 
ference assumption turn out to correspond simply to different transformations 
of the scale, the difference assumption scale values being essentially equivalent 
to the logarithm of the ratio assumption scale values. With respect to the 
logistic, the arc sine, and the normal curve assumptions, these are, from a 
mathematical point of view, distinct and unrelated curves. If any one of 
them holds, the other two distinctly do not hold. It. would of course be 
possible for all three to be erroneous. However, from the viewpoint of actual 
application: to data, the results predicted from these various curves are 
very similar. Except for unusually extensive experiments involving several 
thousand judgments, it is not possible to distiaguish between these curves 
in terms of fit to data. 

The different curves do, however, have different properties with respect 
to genera!i.;iig -vond the linear pairec cox: urison situation. For purposes 
of analysis » *«~«. nce tests, the arc sine « a very convenient function. 
The logist:: g-ac alives readily to com.2r:...%s involving more than two 
objects, a.ad also lends itself readily to the dev2lopment of maximum likeli- 
hood tests for fit. However, if one deals with the multidimensional situation, 
where the objects being compared are characterized not by a single number 
but by several numbers, then the logistic is 2 more laborious type of formula- 
tion. The normal curve generalizes quite smoothly to multidimensional 
situations, but is more awkward for th< Je ve«‘opment of maximum likelihood 
tests and checking on goodness of fit. 

With the development of Mostelie:’. o dness of fit test for the law of 
comparative judgment [33], we now have a precise test for determining if 
the proportions predicted from scale values computed according to one of 
the comparative judgment laws agree reasonably well with the experimental 
proportions. In fact, in some cases the agreement of one set of observations 
with another set is better than that of either set with the law of comparative 
judgment. 

Gulliksen and Tukey have proposed some variance components pro- 
cedures for data collected by the law of comparative judgment [22]. The 
law of comparative judgment, however, is useful only in dealing with a set 
of stimuli where there would be a fair amount of disagreement, though there 
need not be confusion or disagreement between the two ends of the scale. 
Precise methods for determining scale values have been presented (Morrissey 
[29] and Gulliksen [18]) that apply to the situation where only a few, say 
four or five, adjacent stimuli would show a reasonable percentage of confusion 
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in judgments. This, however, still means that if one wishes to investigate 
a very large scale, the paired comparison method would be rather laborious 
experimentally, both for the subjects and for the experimenter. 


Successive Intervals and the Law of Categorical Judgment 


Successive intervals and the law of categorical judgment provide a 
better way of studying a large set of stimuli. This method was generalized 
from the method of equal-appearing intervals. The subject is simply asked 
to place the stimuli in categories, each of which is larger than the one below 
it and smaller than the one above. It is assumed that the categories might 
well vary in width, with as many unknowns as there are categories of these 
unknown widths. Again, one must note that the law of categorical judgment 
is also applicable to the situation where any measurable physical charac- 
teristics of the stimuli are not correlated in any useful fashion with the 
subjective judgments. Analytical methods may be metric or nonmetric. 
Coombs and his associates have worked on the development of nonmetric 
scaling methods for linear and multidimensional domains [5, 8]. 

Various sorts of behavior can be used for determining scale values. 
So far we have discussed paired comparison judgments and categorical 
judgments. Shepard has utilized confusion errors in discrimination learning 
as an index of similarity to establish a scale ([38]; [21], ch. 4). 


Scaling Methods: A Law of Human Behavior 


In all of these cases we have simply a method of obtaining a measurement. 
The real objective of any science lies not solely in accurate measurement 
although it is essential that measurements be developed that satisfy a set 
of basie postulates. The next question to raise is, ‘“Can these measurements 
be utilized to establish certain laws?” We will now consider developments 
in this area. 

It should be noted, first of all, that all of the material in psychophysics— 
the logarithmic law, the power law, the law of comparative or of categorical 
judgment, the deductions from Luce’s choice axiom, and the Bradley-Terry 
ratio law—are basically statements regarding the behavior of organisms in 
making a difficult decision. That is, making a judgment which is so difficult 
that judgments in successive trials will not agree with each other. The laws 
we have been considering here specify a very precise and verifiable rule 
for behavior of the organism in such ambiguous situations. It is interesting 
to find that similar rules are followed by the nervous system in responding 
to such situations, regardless of whether they deal with brightness or hue 
of lights, pitch or loudness of sounds, beauty of pictures, value of objects, 
and so on. Experimental situations range from those evoking some extremely 
concrete, objective, verifiable judgments, where one can say that the person 
as a discriminating machine judged correctly or judged incorrectly, to the 
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cases in which the human organism is the only machine capable of making 
the judgments in question, and verification can come only in terms of the 
repeatability of sets of judgments. The single judgments, in other words, 
are not repeated, but the groups of judgments show a certain consistent trend. 
Fullerton and Cattell pointed out this fact and stressed its importance [14j. 


Scaling to Establish Scientific Laws 


The scale values that are obtained by the various psychophysical pro- 
cedures constitute measurements which then can be utilized for establishing 
laws, laws which either relate (i) these measurements to other measurements, 
(ii) the measurements to behavior, or (iii) different sets of subjective measure- 
ments to each other. Some illustrations in each of these categories will be 
indicated. 

The establishment of such laws was the original interest of Weber and 
Fechner. They established func!.onal relationships between the physical 
measurement of the stimulus and the subjective measurements. It is possible 
also to relate these subjective measurements to the behavior of the organism, 
as in studies by Thurstone [42] and Jones ([21], ch. 2) where choice of menu 
items has been predicted from the obtained scale values. Some subjective 
measurements, namely attitudes expressed in attitude scales, changed with 
influences such as lectures or movies [42]. Methods of influencing human 
behavior can be measured in this way. 


Scaling to Study Diminishing Returns 


It has also been found that these measurements can exhibit certain 
consistencies among themselves; for example, studies have been made using 
scale values for single objects and for double objects. The findings have 
been that a linear value law holds fairly well as long as one does not extend 
the scale too far. (See a study of birthday gifts by Thurstone and Jones 
[42], pp. 195-210.) 

It is possible, correspondingly, to state precise predictions regarding 
scale values of composites that would result from a square root, a logarithmic, 
or an exponential type of diminishing returns law. One test of these [17] 
has indicated that the negative exponential law of diminishing returns is 
in reasonable agreement with data, although more extensive work involving 
a wider range of values will have to be undertaken before one can be certain 
regarding the linear vs. some type of diminishing returns. 

The relation of adjectives and adverbs studied by Cliff [4] established 
the rule that “adverbs multiply adjectives,” and this has proved to be very 
widely applicable. Four foreign countries in which the schedule of adverb- 
adjective combinations has been administered show a striking verification of 
the initial rule. 

Various principles in economics, sociology, ethics, aesthetics, and lin- 
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guistics could be investigated in a precise manner beginning with scaling 
procedures, but distinctly not stopping solely with scaling procedures. The 
analysis should search for uniformities that appear in these sets of scale 
values, or in the relations of these scale values to aspects of human behavior 
that are measured by other than subjective means. Osgood, Suci, and Tan- 
nenbaum [35] have used a type of scaling, the semantic differential, to study 
meanings of words in different groups and in different countries. Some rather 
striking similarities across cultures are being found by these studies. 

Runkel [37] has a very interesting study of perception of cognitive 
relations in a set of statements about psychology. Using Coombs’ nonmetric 
methods, he finds that students, whose structure of agreement and disa- 
greement with these statements is similar to the instructor’s, make higher 
grades. There is no relation to A.C.E. scores. Dember [11] has utilized some 
nonmetric scaling procedures in studying problems of perception. 


Multidimensional Scaling 


One of the post-Fechnerian developments of great interest is multi- 
dimensional scaling. In dealing with colors or with sounds, it soon became 
obvious that one needed to develop several different sorts of linear scales 
in order to state adequately the interrelationships of the system. Pure tones 
differed subjectively in loudness and in pitch and in corresponding correlated 
physical characteristics of intensity and frequency. 

Colors differ in hue, tint, and chroma. The usual approach has been 
to scale each dimension separately and then put the scales together, assuming 
orthogonality. In the multidimensional approach, by contrast, one would 
begin with only one basic concept, the concept of relative similarity and 
difference, and determine the number of dimensions implied by this informa- 
tion. From a common-sense point of view, we can see that the relative 
similarities and differences of a set of objects can imply different dimen- 
sionalities. For example, if we have, say, four points A, B, C, D and are given 
the information that the relative distances from A to B is 2, B to C is 2, 
C to D is 2, and that the AC distance is 4, the BD distance is 4, and the 
AD distance is 6, clearly then this set of measurements gives the information 
that the four points can be adequately represented by a single dimension 
insofar as this particular set of differences is concerned. 

A more complex case would be illustrated where, let us say, the AD 
distance is 8, the BC distance is 6, and all the other distances 5. If one tried 
to put such distances together, it can be seen that a two-dimensional figure 
(a rhombus) would be obtained. This is not a trivial distinction. To say that 
this is a two-dimensional system means that one must represent each of the 
objects by two different numbers, one number corresponding to the hori- 
zontal and the other corresponding to the vertical dimension, in order to 
represent the set of interrelations stated here. This is at a purely intuitive 
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level and only with four objects. When one deals with a very large number of 
objects, the situation becomes considerably more complicated. The general 
question is, ‘‘Can one take a set of inter-object differences of this sort and 
from these distances determine in some reasonably convenient way the 
implied dimensionality?” Young and Householder [46] developed the basic 
theorems for the dimensionality of the set of points in terms of interpoint 
distances, showing that, if one took the bordered matrix of squared inter- 
point distances, the dimensionality of the set of points is two less than the 
rank of this particular matrix. 

From the psychological viewpoint the important thing to notice here 
is the simplicity of the basic data that are involved. One need not fret about 
communicating to the subject the difference between whiteness, saturation, 
and hue, or communicating the distinction between loudness and pitch. One 
simply asks: to what extent are these things we have here the same, to what 
extent are they different? Animal matching experiments have been con- 
ducted, so that one could quite readily communicate the basic idea necessary 
for multidimensional scaling to a rat or chimpanzee. Given the basic judg- 
ments regarding similarity, the mathematical procedure then takes over and 
determines how many dimensions are necessitated by this particular set 
of judgments. It seemed initially that the type of judgment required here 
might be so complicated as to become unwieldy, to become unduplicatable— 
let us say unreliable. 

The initial experiment by Richardson [36]. dealing with colors of a single 
hue and different saturations, showed that subjects could make these judg- 
ments in a reasonably systematic fashion. By these procedures one obtained 
an independent verification of scale separations more or less agreeing with 
those of the Munsell system. A second study, conducted by Klingberg [25] 
under Richardson’s direction, dealt with the estimates of the probability of 
war among a certain set of sovereign states. Again it was found that the 
judgments could be made and that they were reasonably stable; the data 
showed the seven countries formed a four-dimensional system with respect 
to probabilities of war. 

Torgerson [44] conducted an investigation of a sey of grays which, of 
course, could have been more than one-dimensional by this particular experi- 
mental and analytical system. The analysis showed a clear one-dimensional 
structure. A set of colors, reds of different saturations and brightnesses, also 
showed a two-dimensional structure in considerable agreement with the 
Munseil seale values. 

Helm [23] has investigated a slice of different hues through the color 
pyramid and again found that the multidimensional scaling routines work. 
They gave him essentially something like a section of the color pyramid. 
The judgments required by these experiments, although apparently rather 
difficult and urtusual judgments, are nevertheless stable and duplicatable. 
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In Helm’s particular case, he asked the subject to put the two chips which 
were most unlike in fixed positions on a plane and then to “put the other 
chip where it belongs.’’ In other words, if it were just as different from A 
as A was from B and just as different from B as B was from A, the chip would 
then come at the third apex of an equilateral triangle. It could vary from 
there to half way between, to either side. The subjects would insist they did 
not quite know what they were judging, did not know how to do this, but 
showed that they were repeating these judgments on a second occasion. 

Interestingly enough, far from being an inaccurate and unusable kind 
of judgment, this judgment was accurate and sensitive enough to detect 
individual differences in color vision and to detect, by the peculiarity of 
the resulting configuration of the hues, hitherto unsuspected weaknesses in 
color perception on the part of some of the subjects. A preliminary experi- 
ment by Donald Thomas dealing with odors has shown that such judgments 
can be made with respect to odors, and that on a small set of odors certain 
aspects of the Crocker-Henderson system are verified. 

Will these multidimensional methods work in more complex areas? 
Messick [27] asked his subjects to judge the probability that a ‘‘person who 
believed statement J would sl*~ believe statement J.’’ This was done sys- 
tematically for a set of 21 sta. ats, 7 from each of three scales. He found 
that the statements about ... ial punishment and about treatment of 
criminals constituted a single scale, while attitude toward war was a different 
scale at an angle of about ®0 degrees to the first one. These findings were 
essentially identical for a greap of Air Force cadets and a group of theological 
seminary students. 

Morton [30] studied 15 members of each of two different fraternities 
with respect solely to friendship, asking the question, “‘Which one would 
you rather have as your friend?’”’ He determined from these data that the 
friendship interrelations among the 15 people involved a four-dimensional 
space. This four-dimensional space, then, was matched with a number of 
linear scales—grades, athletic ability, intelligence, attitude toward girls, 
toward fraternities, and toward various things. It was found, for instance, 
that in one of the fraternities the major dimension determining friendship 
interrelations turned out to be a grades-intelligence complex. In the other 
fraternity, th major dimension turned out to be an athletic interest-athletic 
competence dimension. The first fraternity did, interestingly enough, have 
a reputation as a scholastic fraternity and the other as an athletic fraternity. 

It should be noted, I think, that these results are not necessarily obvious, 
and are not the only type of result that could be obtained. It might well 
have been, for example, that in a scholastic type fraternity, selection of 
members would have been so homogeneous that there would not have been 
the possibility for correlation on the basis of scholastic ability. Friendship 
lines would then follow other lines within the group. Correspondingly, if 








a ell oe oo as 


-—- =/_ cr oe 


ws 


Sn ee, ie | 


“=—ceole °° | 











HAROLD GULLIKSEN 17 


the athletic fraternity members had all been selected for high athletic ability, 
then possibly the friendship structure would depend on similarities with 
respect to intellectual interests or on similarities with respect to particular 
type of athletic interest. 

Enough has been said on multidimensional scaling to indicate that it 
is a rather powerful technique for investigating a wide array of situations. 
The basic experimental question is a very simple one. Despite a superficia! 
appearance of difficulty and unreasonableness, one can get consistent answers 
and can come up with rather interesting conclusions—some of which verify 
the results of unidimensional scaling and others of which go beyond. 


Non-Euclidean Systems 


The studies that I have cited have dealt primarily with a Euclidean 
type approach to multidimensional scaling, making use basically of the 
principle that the square of the distance from A to B is the sum of the squares 
of the distances in the separate dimensions. Some other systems are proposed 
also for dealing with the multidimensional area. Landahl [24] has proposed 
and Attneave [1] has discussed a ‘‘city block model” in which the distance 
from A to B plus the distance from B to C constitutes the distance from 
A to C. If one thinks of city blocks and thinks of being limited to walking 
along certain pathways it is clear that this might be true even in a space that 
was more than one-dimensional. One would, of course, have to be careful 
about which pathways were considered permissible. 

Coombs [6] and some of his students have tried a nonmetric approach, 
a multidimensional unfolding technique, in which the rank order of distance 
of each object from each of the others gives the basic clue as to the dimen- 
sionality of the space. So far this approach has not been able to handle as 
many dimensions or stimuli conveniently as does the Euclidean type approach. 
Coombs [7] has used confusion of Morse code signals as an indicator of 
similarity and finds two dimensions in the subset of 10 signals studied. An 
unpublished study by Royal on facial expression of emotions obtained three 
dimensions, similar to Schlosberg’s analysis. We thus have quite a variety 
of approaches to multidimensional scaling. 


Individual Differences 


In addition to the extensions of the original Fechner unidimensional 
scaling methods to the areas already mentioned, I also wish to indicate 
another promising new line of development, which provides for individual 
differences in the system of scale values. One method of handling individual 
differences in scale values has been suggested by Coombs [5, 6]. The objects 
are regarded as one set of points (say 7) and the judges as another set of 
points (say j). The distance from a specified j point to each of the objects is 
monotonically related to the ratings or scale values for that judge. The same 
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is true for each of the other judges, giving the possibility of a great many 
different individual points of view for dealing with the set of objects. 

Coombs and Pruitt [9] have studied preferences for various types of 
bets and find interesting individual preferences for ‘‘long shots’’ or for ‘‘even 
bets.’’ Tucker [45] has suggested regarding the objects as a set of vectors 
(say J) and the judges as another set of vectors (say J). The product of a 
specified J vector with each of the J vectors gives the scale values of each 
of the objects, 7, for that particular judge. The same is true for each of the 
other judges. Perhaps I can best illustrate this with a very simple set-up 
in which there are four objects: A, B, C, and D. Suppose Persons 1, 2, and 3 
assign the following ratings to these objects. 


Objects 
Person A B 3 D 
1 2 1 3 4 
2 2 4 1 3 
3 2 3 4 1 
Mean 2 22 22 22 


If we follow the usual procedure of averaging ratings, then object A gets 
an average rating of 2. The other three get an average rank of 23, so A is 
clearly the first on the average. Perhaps if A, B, C, and D are four people, 
and we have a political caucus situation where the party convention must 
agree on a candidate, we would find that these results are correct—Mr. A 
wins because neither of the other three candidates can swing enough votes 
from those who rate them very low. So from one point of view the average 
scale value could be regarded as a correct representation of the results. 
However, from another point of view, if we are dealing with a market situa- 
tion, A, B, C, D are commodities. On the basis of, shall we say, psychological 
research, one concern puts out commodity A as the best one. The others 
who put out B, C, and D would get all of the sales, however. The ‘‘second 
best”” A would not get any sales since there would always be something 
different, B, C, or D, that would get the first choice, if available. 

Essentially this procedure is a factoring procedure. We might represent 
these four items by putting B in the center and A, C, and D at the vertices 
of an equilateral triangle. The different people concerned then would be 
thought of as vectors through this two space. You could then see how each 
person’s point of view would be correctly represented by such a simple 
diagram (as in Figure 1). 

A simplified description of this procedure is that one obtains the scale 
values of the set of objects for each person. These scale values are arranged 
in an “objects” by ‘‘persons’”’ matrix. The rank of this matrix indicates the 
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Figure 1 
Illustrative Preference Structure 


number of dimensions in which the stimulus vectors and the person vectors 
would be placed in order to represent the experimentally observed relation- 
ships. To finish with a precise statement, the scale value for each object 
for each person would be the vector product of the vector representing that 
person by the vector representing that particular object. In one study of 
preferences for a set of foods, Tucker [45] found a two-dimensional system 
with the major dichotomy being between berries vs. melons. A minor dimen- 
sion separates the three different berries studied and the three different melons. 

The vector model has been applied in linear scaling to show individual 
differences among people in preferences for wrist watches [45], for desserts 
((21], ch. 13), or for automobile body types. A study showing different schools 
of thought in rating prestige of occupations was reported earlier in these 
meetings (see [20]). The fact that different people may have different linear 
orderings of the same set of stimuli is represented in the vector model. 

We also have the problem of individual differences in multidimensional 
perceptions. So far I know of only two studies in this area. Helm studied 
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individual differences in color perception. He obtained estimates of interpoint 
distances for each person. This matrix of pairs of points by persons was 
factored to determine the number of different ‘‘schools of thought” regarding 
interpoint distances. The matrix was rank three with the people falling along 
two sharply defined lines (Figure 2). One dimension represented varying 
degrees of color deficiency (Figure 3). The other represented a tendency 
to underestimate large distances vs. a tendency to underestimate small 
distances. 

Messick has also conducted some multidimensional studies of individual 
differences in perceived similarities among political figures, using questions 
of the type, “Is Roosevelt more like Hitler or McCarthy?” (Figure 4, Table 1). 
Republicans and Democrats as groups were not too different, but within 
these groups there were interesting multidimensional differences. Some people 
had a simple one-dimensional structure—from good to bad. Others had more 
complex two-dimensional or even five-dimensional structures. 

Hitherto it was necessary to predefine groups and see if they were 
similar or different. Now we can answer directly by factor analysis the 
question, ‘‘How many kinds of opinion are there?” 
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FIGuRE 2 
Factor Structure of Individuals: Color Interpoint Distance Data 
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FIGURE 4 
Factor Structure of Individuals: Politics Interpoint Distance Data 


Computer Programs 


I should also note that the various methods presented here are computa- 
tionally quite laborious. However, with modern electronic computing ma- 
chines we have programs developed for dealing with the multidimensional 
analyses, for dealing with the vector model, for dealing with paired com- 
parisons and the law of comparative judgment, for the computations associ- 
ated with successive intervals and the law of categorical judgment, and also 
for the computations associated with the multiple rank-order short cuts, 
such as the balanced incomplete blocks and the balanced lattice. 


Conclusion 


The one hundred years since Fechner have been marked by extensions 
of the psychophysical methods (1) to deal systematically with domains in 
which there are no relevant physical measurements, (2) to the multidimen- 
sional domain, and (3) to the systematic description of individual differences 
among a group of people in one-dimensional perceptions. The appropriate 
statistical tests, analysis procedures, and some maximum likelihood esti- 
mates have been developed for some of these methods. 
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TABLE 1. 








Idealized Individual 





B 





In this space, 

there are, possibly, 
seven significant 
dimensions, only one 
of which approaches 
being as large as 
Senator Douglas the major dimensions 
Senator George in the preceding 
Senator Kefauver spaces for idealized 
Franklin D. Roosevelt individuals a and b. 
Adlai Stevenson The structure in 
Governor Talmadge these seven dimensions 
Harry Truman appears rather complex. 
General MacArthur 

Alger Hiss 

Henry Wallace 

Chiang Kai-shek 

Adolf Hitler 

Jawaharlal Nehru 

Joseph Stalin 


Thomas Dewey 
Senator Dirksen 
Dwight Eisenhower 
Senator McCarthy 
Richard Nixon 
Senator Taft 


b+++++4+4 





Various experimental shortcuts which make the job easier for the subject 
have also been developed. The successive intervals method is one such illus- 
tration. The triads method described by Helm is another. The methods for 
handling incomplete data mean an increase in flexibility. Paired comparisons 
have also been simplified by the introduction of multiple rank-order designs, 
such as the balanced incomplete blocks and the balanced lattice. 

It is to be hoped that these methods will not be thought of as ends in 
themselves. That is, one does not establish scale values solely for the purpose 
of obtaining scale values, but for the purpose of describing the way judgments 
behave in certain complex situations in order to study the factors that will 
alter judgments, or to study interrelations among judgments as illustrated 
with the study of the various laws of diminishing returns and rules of lin- 
guistics. 

In the decades ahead, I would look forward to increasing use of these 
methods of subjective measurement to study various psychological laws— 
laws of psychological behavior—which could be applied in such fields as 
economics, sociology, and linguistics. 
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The appearance of a statistical paper in a symposium honoring the 
founding of psychophysics was surely to have been expected. These two 
disciplines had their modern beginnings in about the same period of time 
and the well-known founders of each made at least some contributions in 
the other. Karl Pearson’s research [9] on the personal equation is still a 
good example of painstaking psychophysical experimentation. Indeed much 
of his statistical reputation rests on techniques he developed to analyze 
this data. Fifty years ago Urban [11, 12] published, in a psychological journal, 
the final form of the Miller-Urban weights. Twenty years later biologists [7] 
rediscovered Urban’s weights and now use them routinely. Spearman, too, 
contributed to both fields. - 

For a number of reasons, some obvious, the paths of the statistician 
and the psychophysicist have diverged since those early days. I think the 
most important reason is that the gains to be realized from better control 
of the physical stimulus have overshadowed the possible pay-off from more 
sophisticated statistical design and analysis. I also think that this imbalance 
is fast disappearing. This preoccupation of psychophysicists, along with the 
expansion of statistical effort into so many other scientific areas, has lead 
to a breakdown in communication. It is not that psychophysicists have 
ceased to have statistical problems or that statisticians have ceased to produce 
methods potentially useful to psychophysicists. The statistician cannot be 
intrigued by a psychological problem if he doesn’t hear of it; the psycho- 
physicist cannot apply a statistical technique if it appears in a journal he 
never sees. 

I hope today to point out at least one area of statistical research which 
I feel is likely to prove useful in psychophysics and which could be advanced 
by a little interest from the psychologist. For the purpose of this paper I 
will restrict my discussion to those psychophysical experiments in which 
the response of the subject to the stimulus is limited to a small, discrete 
number of possibilities, i.e., I am leaving out such measures as the force of 
a knee jerk, response times, or the like. It should be noted that this does 
not rule out the method of adjustment in which the interpretation could 
be made that the response is binary, yes or no, but is continuous in time. 


*Operated with support from the United States Army, Navy, and Air Force. 
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I shall also wish to assume that there is a psychometric function the 
existence of which is independent of ways of measuring it although its mani- 
festations are surely not. This means that given a stimulus, a set of responses 
defined by a set of instructions, a subject, and an occasion on which they meet, 
there is a unique probability distribution over the set of possible responses. 
Further there exists a priori a scale of measurement of the stimulus which 
is monotonically related to that probability. This last makes, for me, the dis- 
tinction between psychophysics and psychological scaling. 

My final presupposition is perhaps controversial. This is that the experi- 
menter has some definite purpose in mind. There is a well-defined question 
he wants to answer. He may want to know what the median of the psycho- 
metric function is, or the parameters of the best-fitting normal ogive, or 
some set of percentiles. ‘‘What is the shape of the function?” is not such a 
question! The answer requires knowledge of the values of the function at 
an indefinitely large number of stimulus energies. This is unobtainable without 
an indefinitely large number of observations. It’s another thing to ask ‘Is 
it more like an integrated normal or an integrated rectangular distribution?” 
This question is clear and points to definite portions of the function which 
need examination. 

Granting this presupposition it is clear, given a definite question, that 
some methods are going to be more efficient in answering it than others. 
Further, it is quite unlikely that the same methods will prove most efficient 
for answering all such questions. From this point of view I would like to 
discuss some of the methods currently in use in psychophysical research, 
some methods from other related disciplines such as biological assay and 
explosives testing, and some generalizations of these so far unsullied by any 
application whatever. 

I will be mainly concerned with the problem of stimulus programming 
and its use in obtaining answers of known precision to definite questions. 
I will speak in terms of a “‘yes-no”’ indicator response but I hope it will be 
clear that the problems and proposals I have to make are not limited to 
such response classes. With only minor changes, the entire discussion could 
be repeated for the ‘‘forced-choice”’ or any other technique yielding quantal 


responses. 


Stimulus Programming 


By stimulus programming I mean the scheme by which the experimenter 
determines which stimulus is to be presented to the subject next. This may 
be determined independently of the subject’s responses as in the method of 
constant stimuli, or it may depend to any practical degree of complexity on 
the whole history of the experiment up till now. 

In the simplest form of the ascending method of limits for example, 
if the previous response was a “‘no”’ the next more intense stimulus is pre- 
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sented; if a ‘‘yes’’ the next stimulus is the old starting stimulus. Or it may 
be chosen randomly or haphazardly from a set of possible stimuli. Here the 
choice depends only on the previous response. A more complex method of 
limits, say one requiring two positive responses to end a run, may require 
knowledge of the whole history of that run. In a third method, the ‘‘up-and- 
down”’ method [1, 5], a “‘no” response causes the next more intense stimulus 
to be presented, a “‘yes’”’ response causes the next less intense. Now for what 
definite questions might these methods be appropriate? We seem to have 
put the cart before the horse, but then we had the cart first. 

We can avoid technical discussion here if we agree that the best way 
to gain knowledge about a particular point on the psychometric function is 
to take a lot of observations as near that point as possible. Suppose we 
are interested in the median of the function, that stimulus which elicits a 
positive response 50 percent of the time. The efficiency of the method of 
constant stimuli for answering this question depends of course on the choice 
of stimuli. Suppose that five equally spaced stimuli are used. If the experi- 
menter is both careful and lucky, four of these will be in the uncertain region 
and one or maybe two will be near the median; the latter will contribute 
almost all of the information he gets. He will have used about 60 percent of 
his effort just getting the other 40 percent near the point of interest. 

With the up-and-down method, most of the observations will be con- 
centrated at the two stimuli bracketing the median and only a few observa- 
tions used in getting there. The ascending method of limits is ideally suited 
not to get observations near the median. Even a two-sided method of limits 
will concentrate observations near the upper and lower percentage points, 
and away from the median. If one really is interested in the median alone 
the up-and-down-method beats the others easily, and if we go even further 
and use infinitesimal step sizes we reach the Békésy audiometer technique [2] 
with continuously variable amplitude. At least on a common-sense level this 
should yield the best possible median measure. Unfortunately the statisticians 
have provided no rational analysis of the instrument so we cannot be sure. 
This is surely an unsolved statistical problem which merits attention. 

If on the other hand the question of interest concerns the 10th percentile 
of the psychometric function, neither the up-and-down method nor the 
Békésy method is likely to be of much use. The method of constant stimuli 
will, with luck, do some good. The ascending method of limits ought to be 
very accurate. The efficiency of a method depends upon the question asked. The 
only general statement seems to be that, for answering a particular question, 
a method which ignores data gathered so far in the experiment is not likely 
to be more efficient than a method which uses it. 

Evidently the stimulus programs mentioned so far do not exhaust the 
possibilities for systematically modifying the experiment while it is in progress. 
As a matter of fact the possibilities are staggering. What is needed is a way 
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of weeding out some of them on extra-statistical grounds. Any method re- 
quiring extensive analysis of preceding data before another stimulus is 
presented is just not practical in psychophysical measurement. What is 
meant by ‘‘extensive analysis’ will depend on the experimental resources 
available, whether the experiment is being run automatically or by hand, 
and other considerations of this sort, but I would like to propose for con- 
sideration a class of designs, called Markov designs, which I feel would be 
simple to run in many circumstances. The specications for a Markov design 
are: 


(1) adiscrete set of stimulus v lues the experimenter is prepared to use; 

(2) a rule stating, as a func’ »n of the stimulus and the outcome of 
the preceding trial only, the probability that each of the possible 
stimulus values will be used next; and 

(3) a convenient random device for choosing the next stimulus ac- 
cording to that probability. 


For example one might modify the up-and-down method by decreasing the 
intensity of the stimulus by three steps for each detection, still increasing 
it by one step after a miss. This would tend to concentrate observations in 
the neighborhood of the 25th percentile. 

A little reflection shows that except for the Békésy technique all the 
preceding methods fall in this class. The method of constant stimuli is in 


the class vacuously, the others in a more meaningful way. The advantages 
of this class, besides experimental convenience, are (i) it provides a very 
wide choice while, at the same time, (ii) the properties of any member of 
the class are easily investigated using the theory of Markov chains, and 
(iii) the analysis of the resulting data from any of them is no more difficult than 
the efficient analysis of data produced by the method of constant stimuli, 
and is sometimes easier. 

Another class of methods are those which require adjustment on a 
continuous scale of the stimulus values. The Békésy method and the method 
of adjustment are two familiar members. These both require continuous 
response, and as yet no one has provided more than rule-of-thumb methods 
for analyzing the data. The Robbins-Munro [10] stochastic approximation 
method shows much promise, but its small sample properties have not been 
investigated. As applied to psychophysical responses it is similar to the “‘up- 
and-down’’ method except that the step size is slowly decreased as the 
experiment proceeds. It is readily adaptable to the investigation of any 
percentile of the psychometric function, and, it seems to me, deserves some 
attention in our laboratories. 

Now there is a psychological objection to all these methods. How does 
all this stimulus programming affect the response once we have finally decided 
what stimulus to use? What about errors of habituation and anticipation? 
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These questions have two bases. The first is the effect of immediately previous 
experience on the subject’s sensitivity; the second is the effect of the subject’s 
knowledge of the stimulating conditions on other than the relevant sensory 
basis. I would suggest that the first basis of criticism is a fallacious one. To 
the extent that immediately previous experience does affect the subject’s 
sensitivity, this phenomenon itself is a legitimate, even imperative, object 
of study which requires experimental methods to expose it rather than 
bury it in residual variance. The second basis of criticism is harder to answer. 
It is true that the Markov designs vary widely in the extent to which the 
subject can have such knowledge, but unfortunately it appears likely that 
the advantages of the method will decrease with any decrease in the linkage 
between the choice of the current stimulus and the accumulated data. One 
possible solution may be the interleaving of several Markov series. At any 
rate the gains in efficiency possible with these methods are so important that 
some effort should be spent in overcoming the problem. This is particularly 
true when the psychometric function is changing rapidly in time. In adapta- 
tion studies a 50 percent reduction in measurement time can make an im- 
portant theoretical contribution. I should mention in this respect the studies 
of Blough [4] in dark adaptation in the pigeon and Mitchell and Liandansky [8] 
in human dark adaptation. ~ 


Statistical Analysis and Inference 


Nothing that has been said so far depends in any way on the assumed 
shape of the psychometric function. If the experimenter has carefully con- 
centrated iis observations in the regions of the curve he cares about, such 
assumptions merely get in his way. The most pertinent data about any 
percentile of the psychometric function are the set of observations yielding 
that percentage. Only when there is a scarcity of pertinent data are such 
extrapolating devices needed. 

The point of graduating a set of data with a mathematical function is 
never to ascribe ‘“‘reality’’ to that function, but rather to summarize the 
data economically with a reasonable curve. By summarize I do not mean 
merely, a point estimate. At the very least I mean a point estimate and a 
valid estimate of its error. The current use of an eye-fit to estimate psycho- 
metric functions does yield point estimates but how accurate are they? 
Given, through careful stimulus programming, that the data are relevant, 
the importance of data-fitting is mainly in providing such error estimates. 
One solution would be, of course, to make such estimates nonparametrically. 
I would guess, however, that fitting a specific function, even if it were a 
little wrong, would be more efficient in the long run. 

The function most widely used for such data.graduation today is the 
normal ogive, Urban’s phi-gamma hypothesis. The methods for fitting this 
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curve have been carefully worked out and are available for use by any 
statistical clerk (or graduate assistant) in Finney’s excellent Probit Analysis 
[6]. Estimates of any percentile and its standard error can be computed— 
estimates which depend on the distribution of stimuli sampled. Although 
the analysis is developed for constant stimuli, it can be directly applied to 
data gathered by any of the Markov methods, unless there are very few 
observations. Even then the modifications are simple. 

Another function which is receiving much attention is the logistic [3]. 
As far as fitting data is concerned, it is almost indistinguishable from the 
normal ogive and has in addition certain desirable statistical features. It is 
a two-parameter curve like the normal ogive, but more important there are 
two statistics easily computed from the data which summarize all information 
relevant to the maximum likelihood estimate of the parameters. In practical 
terms this means that much of the tedious computation of probit analysis 
can be bypassed by the production of one double-entry table. 

Still other functions have been used, but the main point is that if the 
stimulus program has been effectiv:, the choice is a minor one which quite 
properly can be made on the basis of mathematical convenience. 


Summary and Conclusion 


The major thesis of this paper has been that one should choose a psycho- 
physical method appropriate to the question which prompted the inquiry. 
It has been suggested that the method of constant stimuli, although not 
inappropriate for most questions, is by no means optimal for any particular 
question. A class of stimulus programming techniques called Markov designs 
has been briefly described. These have the principal advantage of using data 
as it is gathered to improve the selection of stimuli for the remainder of the 
experiment. This improvement is achieved by concentrating observations 
in the region of most interest. They are also reasonably easy to carry out 
and can be tailored to a wide range of experimental objectives, and pose no 
new problems in data analysis. 

Finally data analysis is discussed. It is pointed out that the question of 
which mathematical form to use in graduating the data is a minor one if the 
data are efficiently gathered and that, within reason, mathematical con- 
venience is a good criterion. An advantage of the logistic function over the 
normal ogive is mentioned. 

Few of the ideas advanced here are new but many have not been ap- 
preciated fully by psychophysicists. The emphasis in this paper has been on 
the areas of sensitivity testing in which psychophysics has lagged behind best 
statistical practice, although it was pointed out that no efficient methods of 
analysis of the continuous response procedures have yet been devised by the 
statisticians. Better communication in both directions is clearly desirable. 
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The occasion of a symposium to honor Fechner on the centennial of 
the publication of his Elemente der Psychophysik [1] has provided an excuse 
to review the Fechnerian legacy and to ask what perversity in the nature 
of science or the Zeitgeist has made it possible for Fechner’s concept of error — 
to persist for a hundred years and grow famous. The central feature of this 
legacy is the notion that the jnd_ (just noticeable difference) can serve as a 
unit for the measurement of sensation. Since the jnd is really a measure of 
the uncertainty, dispersion, or variability among a set of judgments, this 
proposition makes residual “noise’”’ the yardstick of sensory magnitude—a 
notion as curious as it seems improbable. Why did this conception survive 
its inventor, and why did it become, under Thurstone’s expert guidance, 
the foundation of psychological measurement applied to subjective values | 
and attitudes? Thurstone’s Case V in which he assumes equal discriminal | 
dispersions (equal subjective variabilities) matches in effect the Fechnerian | 
assumption that all jnds are subjectively equal. 

Fechner and Thurstone both labored in the same tradition of psycho- 
physics. Practically all procedures that try to make scale units out of experi- 
mental variability stem from the efforts of these two pioneers. Fechner 
studied sensation and perfected methods for the determination of the jnd, 
but he also studied esthetic judgment by paired comparisons. Thurstone 
improved the machinery for handling the data of paired comparisons and 
proceeded ‘‘to adopt and extend the psychophysical methods to interesting 
stimuli” like attitudes and preferences ({22], p. 213). Whatever the subject 


development is the notion that the units of a psychological scale can be 
fabricated from observations on variability, by means of a systematic analysis 
of one kind or another. Take away variability and there remains no measure- 
ment. 

Since Fechner was concerned mainly with stimuli that he could measure 
on physical scales, we can compress his model into two assertions concerning 
what happens when the resolving power of a human observer is tested. 


1. The observer's variability is proportional to the stimulus magnitude-/x, 
(This is Weber’s law.) 1 
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2. To each unit amount of variability (a jnd), there corresponds a 
constant psychological distance. 


From these two assumptions Fechner deduced his erroneous ‘‘law”’ that the 
psychological magnitude grows as the logarithm of the stimulus magnitude. 

Thurstone dealt with stimuli for which the physical scale was often 
only nominal or ordinal, but he too stated a ‘“‘law.”’ This is the so-called law 
of comparative judgment, a relation between the distance on a psychological 
continuum corresponding to two stimuli and the variabilities of the subjective 
impressions produced by the stimuli. It can be written 


Vi — Yo = 22 Voi + a2 — 2ro,02 





where 
y, and y, are the subjective values produced by the two stimuli, 
o, and o; are the variances of the subjective values produced by the 
two stimuli, 
r is the correlation between judgments for the two stimuli, 
2,2 is the normal deviate corresponding to the proportion of times 
one stimulus is judged to rank higher than the other. 


Thurstone sometimes called it the equation of comparative judgment, 
a more fitting and appropriate name, because the equation is not a law in 
the ordinary scientific sense. It is a model, so to speak, of a very general 


sort. It becomes a usable, testable equation only after assumptions have 
been made about how subjective variability (discriminal dispersion) does 
in fact behave. The set of assumptions most often invoked in practice is 
called Case V: all the discriminal dispersions are assumed to be constant, 
which is essentially equivalent to the second of Fechner’s two assumptions. 

This, then, is the heart of what Thurstone called the modern psycho- 
physics. The equation of comparative judgment is supposed to provide a 
means of measuring distances along the psychological continuum regardless 
of the nature oi the stimulus scale. Obviously, however, the stimuli must 
be close enough together so that they are occasionally confused, for otherwise 
no values of z can be stated. This requirement that the procedures must 
generate sufficient confusion makes it almost prohibitive to try to explore 
the full reach of an extended continuum like loudness without invoking other 
elaborations of the model—elaborations that are sometimes grouped under 
the questionable heading, ‘‘law of categorical judgment’’ [23]. (Again one 
notes a rather loose use of the word law, plus a misleading adjective.) These 
methods also call for the processing of variabilities. In the method some- 
times called successive intervals, for example, the idea is that when an ob- 
server sorts stimuli into categories ‘he will exhibit errors and confusions; 
and, from assumptions designed to relate these confusions to dispersions on 
the underlying psychological continuum, distances are established along the 
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subjective scale. Authors have christened this scale by a variety of names, 
none of them very transparent in meaning. My proposal would be to call 
it the Category Dispersion Scale. 

There have been many sensible people to whom these procedures have 
seemed eminently sensible. Perhaps it is only from the perspective created 
by the newer techniques for measuring sensory magnitudes that the matter 
appears otherwise. From this point of view an array of dubious assumptions 
about jnds and discriminal dispersions appears to stand as an impediment 
to the straightforward prosecution of psychological measurement. Be that 
as it may, on those sensory continua where the assumptions of Fechner and 
Thurstone can be put to test—where subjective scales have been erected 
against which the distributions of subjective dispersions can in fact be dis- 
played—it turns out that no a priori rule for the processing of variabilities 
will guarantee the production of an interval scale of subjective magnitude. 

(Starting from scales, we can determine error distributions, but starting from 
assumed error distributions we cannot establish scales.) 

Why then have those methods that approach the problem back-end- 
forward succeeded in generating so distinguished a following and such an 
ampue literature? A fuller attempt to answer this question was undertaken 
elsewhere [18], but the following are some of the points presented at the 
symposium. 


Excuses for Processing Variabilities 


Six conjectures can be offered to explain the survival of the methods 
that try to create scales of measurement by “‘unitizing’’ variability in one 
form or another. Whether these reasons can be certified as historically valid 
is less germane to our present concern than is the ability of these six points 
to illuminate the present state of psychological measurement. They help 
to explain why a new ‘‘psychophysical law”’ is called for [11]. 


1. No competition 


The creative industry displayed by Fechner, plus the flair for moderni- 
zation displayed by Thurstone, made the processing of variability the princi- 
pal method of psychological scaling. Plateau invented the method of bisection, 
to be sure, but he made no capital of it. At one stage he even suggested that 
sensation grows as a power function of intensity, but this correct insight he 
later disavowed. He seems to have been deceived by the form of the bisection 
scale erected by his friend Delboeuf. 

Only since the 1930’s has there been a newly oriented attack on the 
growth of sensation, this time with direct ratio-scaling procedures [13]. 
Only since the mid 1950’s have these methods been extended to almost all 
sensory modalities. The new results make it abundantly plain—on some two 
dozen sensory continua—that sensory intensity grows as a power function 
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TABLE 1 


Representative Exponents of the Power Functions Relating Psychological 
Magnitude to Stimulus Magnitude on Prothetic Continua 








Continuum Exponent Conditions 





Binaural 

Monaural 

5° target—dark-adapted eye 
Point source—dark-adapted eye 
Reflectance of gray papers 
Coffee odor 

Heptane 

Saccharine 

Sucrose 

Salt 

Cold—on arm 

Warm—on arm 

60 ¢.p.s.—on finger 

250 c.p.s.—on finger 

White noise stimulus 

Light, sound, touch, and shocks 
Thickness of wood blocks 
Static force on skin 

Lifted weights 

Precision hand dynamometer 
Sound pressure of vocalization 
60 c.p.s. through fingers 


> 


Loudness 
Loudness 
Brightness 
Brightness 
Lightness 
Smell 

Smell 

Taste 

Taste 

Taste 
Temperature 
Temperature 
Vibration 
Vibration 
Duration 
Repetition rate 
Finger span 
Pressure on palm 
Heaviness 
Force of handgrip 
Vocal effort 
Electric shock 


OO > 
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of physical intensity [16]. A list of some of the continua explored and the 
exponents obtained is presented in Table 1. 

It is interesting that Plateau, while advocating the bisection procedure, 
explicitly denied the possibility of direct ratio judgments of sensation—the 
kind of judgments on which Table 1 is based. Nevertheless, if Plateau had 
carried on with his method of bisection, he could probably have demonstrated 
quite convincingly, more than a hundred years ago, that the power law is 
the correct form, but the built-in biases in the method of bisection would 
presumably have prevented his obtaining the correct exponents.\ Bisection 
produces one variety of what we call a partition scalesome of whose proper- 
ties are discussed below. The biases in these scales are such that, when the 
exponent of the magnitude function is estimated from bisection experiments, 
the estimate is systematically too low [20]. Still, if Plateau had stayed with 
the problem, psychological measurement might have been spared its time- 
consuming detour into the realm of disnersion models. 


2. The universality of variability 
The unitizing of noise—the transmuting of scatter into scale units—has 
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the questionable advantage that we seldom lack the experimental variability 
needed to create the units of the scale. Fechner could always find a jnd 
(meaning a finite resolving power) with which to “measure” on-a-sensory 
continuum, and Thurstone could always find dispersions among people’s 
expressions of opinion. Actually, however, scaling by some of Thurstone’s 
procedures May SOmetimes require more variability than the experiment 
happens to provide. I have seen this occur, for example, when the method of 
successive intervals was applied to the sensory continuum loudness. 

One liability borne by the methods that attempt to transform confusion 
and uncertainty into units of measurement is their seeming to devalue the 
premium normally placed upon precision and the elimination of variability. 


3. Wide applicability 

Ever-present variability. means that. experimenters armed with ‘‘dis- 
persion methods’’ can address themselves to almost any subject matter. 
This is advantageous, no doubt, but certainly not decisive. The powerful 
procedures of fundamental] measurement in physics do not lose their im- 
portance merely because they can be applied only to some half dozen continua. 

It should be remarked that certain of the modern ratio-scaling techniques, 
such as magnitude estimation [13], can also be used to scale stimuli for which 
there exists no underlying metric more advanced than a nominal scale. For \ 
example, two of my students have recently measured the subjective roughness 


of sandpapers by magnitude estimation, and have gone on to show that 
this subjective continuum probably belongs to the prothetic class by demon- 
strating that the ‘‘partition scale’ erected by means of category judgments 








is nonlinearly related to the ratio scale of subjective roughness. ~ 

For the magnitude scale, Judith Rich and Irma Silverman presented 
12 grades of sandpaper (nominally 24 to 320 grit) to the observer who made 
two sweeps with his first and second fingers over the paper. The papers 
were presented twice each in a different irregular order to each of 12 observers. 
A paper of medium roughness was presented first and called 10. The observer 
was instructed to assign numbers proportional to the apparent roughness, 
using any numbers he deemed appropriate: whole numbers, decimals, or 
fractions. , ‘a 

For the partition scale the observer tried to divide the continuum into — 
seven equal appearing g intervals by_assigning the numbers 1 to 7 to the 
apparent roughness. At the outset, the smoothest paper was “presented and 
called 1; and then the roughest paper was presented and called 7. Each of 
12 observers judged each paper twice in irregular order. 

As shown in Fig. 1, the partition scale that results when_observers 
try to divide roughness into seven equally spaced categories is nonlinearly 
related to the scale erected by direct magnitude estimation. Try as he may, 
the observer produces categories that are not equally spaced on the subjective 
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Fiaure 1 


Scales of the apparent roughness of sandpaper. The abscissa is the scale 
derived from the geometric means of the magnitude estimations. The small 
arrows mark the locations of the sandpaper stimuli on the linear scale of 
subjective roughness. The circles are the arithmetic means of the category 
judgments (ordinate). The curvature is typical of that found in partition 
scales on prothetic continua. 


continuum. This outcome is the standard, invariant finding on prothetic 
continua [19]. 

The precise curvature of the partition or category scale depends on the 
spacing of the stimuli—an effect that can be neutralized by an iterative 
experimental procedure [19]. The iterated, pure form of the partition scale 
would be somewhat less curved than the category scale shown in Fig. 1. 
On the other hand, when the confusions between the categories are used 
to generate a scale (Thurstone’s method of successive intervals), the result 
is generally more curved than the scale in Fig. 1. This category dispersion scale 
approximates a logarithmic form, an outcome that merely reflects the wide- 
spread tendency of variability to increase roughly in direct proportion to 
magnitude. 

Why does the observer fail when he tries to partition a segment of a 
prothetic continuum into equal intervals? (Beforte an answer is given we 
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should note that he does not fail on metathetic continua.) On prothetic 
continua it is as though the observer finds himself biased by the fact that 
a given difference at the low end of the scale is more noticeable and im- 
pressive than the same absolute difference at the high end of the scale. On 
the metathetic continua of pitch, apparent azimuth, etc., where this asym- 
metry is not present, the category scale is not systematically curved. 
Although the distinction between prothetic and metathetic rests on 
functional criteria [il], it is interesting that some of the better known _pro- 
thetic continua seem to be mediated by an additive mechanism at the physio- 
logical level,-whereas the metathetic continua appear to involve substztutive 
processes at the physiological level. Thus we experience a change in loudness 
when excitation is added to excitation already present on the basilar mem- 
brane, but we note a change in pitch when new excitation is substituted for 
excitation that has been removed, i.e., the pattern of excitation is displaced. . 


4. Easy on the observer 


There is no question that magnitude estimation sometimes puts more 
strain on the typical observer than does an exercise in paired comparisons. 
But is that relevant? At the end of an experiment some observers may ex- 
press surprise that magnitude estimation is not as difficult as they had 
supposed it would be. But that, too, is essentially irrelevant. 

Some people argue that scaling by the processing of variability is some- 
how more objective than trying by the direct scaling methods to assess 
subjective magnitude. If by objective is meant that all the observer needs 
to be is variable, then one may be inclined to agree. But it is hard to see 
how there is any greater objectivity to the scatter of an observer's decisions 
than there is to the average values. 


5. Model makers’ inspiration 


Although certain workers have made some of the ratio-scaling methods 
seem computationally complicated, others of us now scale subjective magni- 
tudes in ways that call only for the computation of a geometric mean, or 
sometimes a median. Those who like to devise more intricate models are 
probably inspired less by simple averaging than they are by the notion that 
a measure of dispersion can be used for something more than the measurement 
of dispersion. 

Thurstone, as we have seen, was the acknowledged leader in the develop- 
ment and exploitation of models designed to relate subjective magnitude to 
disagreements among data. His Cases I to V have formed the basis for a sys- 
tematic exploration in an interesting game called assumptions and conse- 
quences. The assumption that the subjective variability (discriminal disper- 
sion) is constant all up and down the continuum is the one most commonly 





42 PSYCHOMETRIKA 


very wide margin. Here the jnd grows subjectively larger as we go i the 
seale.. Whenever we are concerned with a continuum that behaves like- 
sensory intensity, the Thurstonian scaler would do better to assume that 
subjective dispersion is not constant but is proportional to subjective magni- 
tude. This assumption (call it Case VI) would be a vast improvement in 
many instances, but it has the unhappy defect that it is not always true. 
Under ‘‘constant’’ experimental conditions, variability, or error, tends in 
all physical systems of measurement (including those involving living orga- 
nisms) to be roughly proportional to magnitude, but many factors can operate 
to upset this relation in particular instances. One cannot escape the con- 
clusion that it is very very risky to try to transform error into measurement. 
But if we insist on living dangerously, and if we find a circumstance where 
Case VI can be justified as a proper assumption, the resulting Thurstonian 
scale will turn out to be only an instance of what I have called a logarithmic 
interval scale [15]. It will not be a linear interval scale of the kind Thurstone 
was seeking to construct. 

Models have their uses, to be sure. I do not mean to deny it. The over-all 
aim of the scientific enterprise is the correct representation of the universe 
“on paper’’—in the form of adequately isomorphic models. Even though 
the end product of scientific inquiry is a mapping of nature into a model, 
scientists are not always agreed on how best to proceed. Fechner himself, 
except for his ingenious and erroneous inspiration regarding the measurement 
of sensation as he lay abed on October 22, 1850, was a man who went directly 
to nature for many of his ideas. At the other extreme we have scientists who 
never descend from the pencil-pushing plane. In principle, natural truth 
can be disclosed by both approaches, and there is no a priori rule to tell us 
which will succeed better at any particular juncture. We can only guess. 
In the immediate future psychological measurement may well profit more 
from astute laboratory observations than from ingenious model constructions. 
Ultimately, of course, we will need both. 

Much interest has recently been stirred up by the attempt to apply the 
model known as detection theory to the process by which a person tries to 
detect a signal immersed in a noise. This problem is the jnd problem in 
another guise. Detection theory confronts a narrower issue than that ad- 
dressed by Fechner and Thurstone, for the theory does not extend to psy- 
chological measurement, but only to a kind of statistical decision making. 
Whether a human observer can be compared to the ideal observer (a mathe- 
matical abstraction derived from statistical decision theory) is a matter of 
current dispute [4]. But regardless of how this particular phase of the argu- 
ment may turn out, there seems little question that detection theory provides 
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an interesting approach to the particular problem of digging signals out of 
noise. 

What happens, however, when the noise is turned off? If care is taken 
to get rid of as much of the noise as possible, both in the apparatus and in 
the observers, we have a chance to observe the all-or-none operation of the 
nervous system as it responds to different stimulus levels [21]. In the limit 
we detect the NQ (neural quantum), which shows up as a kind of step 
function. The psychometric function relating the frequency of detection to 
the size of the stimulus increment then becomes linear and takes on the 
precise slope predicted by the simple all-or-none model [17]. 

There seems to be an impression that detection theory and the NQ 
theory are necessarily at odds. This impression is reinforced by Green [2] 
who seems to brush the NQ theory aside by arguing, mistakenly I think, 
that the predicted slope of the psychometric function will fail to be realized 
if the sound stimulus is measured in energy units rather than in pressure 
units. Calculation shows that it makes only a negligible difference which 
units are used; but, quite aside from that, one ought to try to make clear 
that both models may be essentially correct: detection theory when-the-noise 
is_controlling, the NQ theory-whén-the-noise_is.suppressed tothe point.at 
which the inevitable discontinuities-in neural functioning can begin to-show 


throu gh. 


6. Pseudo-differential equations 


Sometimes implicitly, sometimes explicitly, various authors have taken 
the conventional manner of indicating the jnd, namely, AJ, as evidence that 
we are dealing here with the makings of a differential equation. This is un- 
fortunate, because AJ is not a difference of the kind appropriate to a dif- 
ferential equation. The measured AJ is nothing more nor less than an index 
of dispersion—like a standard deviation or a quartile deviation. Since one of 
the common measures of the jnd is the 75th percentile point on a cumulative 
ogive, we probably ought to take this quartile point, Q, as the representative 
symbol of dispersion and insist on writing Weber's law as Q = kJ. QO <+\4 > 

It may be intuitively attractive to suppose that when a psychophysical 
function (e.g., the sone scale of loudness) gets steeper, the jnd (or the Q) 
must get smaller. The opposite is in fact the case, at least when we change 
from a frequency of, say, 1000 cps to 50 eps [5]. The loudness function is 
steeper at 50 cps, but the jnd is larger there. Other instances can be found in 
which the psychophysical function stays put but the jnd changes. Thus 
the brightness function shows rather small change with target area, but the 
published jnd values grow by a factor of about 50 when the area chauges 
from a small to a large field [3]. Or, to take a different example, the sensation 
of electric shock grows rapidly with intensity (exponent = 3.5), but the 
resolving power on this continuum seems to be little if at all better than in 
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the vibration sense, where the exponent is several times smaller (see Table 1). 

The fact of the matter is that the relation between the jnd (the Q of the 
error distribution in an experiment on resolving power) and the slope of the 
subjective magnitude function is no more than a rough correlation. We can 
change either one without changing the other. Stated another way, we can 
change the quartile deviation without changing the median, and vice versa: 


The Psychophysical Law 


So much for the reasons that can be offered to account for the anomalous 
persistence of the romantic notion that out of error and confusion we can 
forge the units of psychological measurement. What have we to offer in its 
place? Perhaps the best answer has already been given in Table 1, which 
contains some of the results obtained by direct ratio scaling. We have as yet 
encountered no exception to what I have had the temerity to call ‘‘the psycho- 
physical law’’—the rule that psychological magnitude y is related to the 
physical stimulus ¢ by 


= k( — dv)"; 


where ¢, is the effective threshold. But it is one thing to accumulate a body 
of results by the repeated application of a procedure and another thing to 
prove that the results are good for something. How valid are the values 
in Table 1, and what are they good for? 

Other evidences can be cited [18], but in some ways the most dramatic 
validation of the scales generated by asking observers to make numerical 
estimations of sensory intensity is the demonstration that these same scales 
can be generated even when no appeal is made to ‘number behavior’ at all. 
By means of cross-modality comparisons, each subjective continuum can 
be related to each other continuum, and the family of power functions 
governing the various sensory continua can all be assigned their appropriate 
exponents relative to that of some base continuum, such as apparent length 
of lines, for example. In practice, however, we have been content to go along 
with the results of the several procedures involving numerical judgments, 
because the findings have stood the test of cross-modality validation. The 
argument runs as follows. 

If, given an appropriate choice of units, two modalities are governed 
by the equations 


¥%i=¢) and Y=¢, 


and if the subjective values y, and y, are equated by asking the observer 
to make the one sensation seem as strong as the other sensation at various 
stimulus levels, then the resulting equal-sensation function will be given by 


oi = dr 
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In terms of logarithms, 


log ¢: = Fo log ¢2. 


In log-log coordinates, therefore, the equal-sensation function should be a 
straight line with a slope equal to the ratio of the two exponents. 

~~ This prediction was nicely borne out by a series of cross-modality matches 
between all possible pairs of the three continua: loudness in the two ears, 
vibration on the finger tip, and electric shock to the fingers [14]. From this 
encouraging beginning, the procedure of cross-modality matching has been 
extended to numerous other pairs, with especial emphasis on what might 
be called scaling by squeezing. 

Using a precision dynamometer, J. C. Stevens and Mack [6] worked 
out the subjective scale relating the subjective force of handgrip to the 
physical force exerted by the subject. This relation turned out to be a power 
function with an exponent of 1.7. Equipped with this scale, we then proceeded 
to take the measure of other sensory continua by asking observers to squeeze 
the dynamometer until the sensation of strain matched the apparent in- 
tensity of a criterion sensation in some other modality [7, 8]. The results 
for each of nine different continua gave equal-sensation functions that 
approximated straight lines in log-log coordinates, as demanded by the 
power law. More interesting, perhaps, is the exact numerical relation between 
the slopes determined by matching with handgrip and those determined by 
matching with numbers (i.e., magnitude estimation). Since the exponent for 
apparent force of handgrip is 1.7, we expect that, if we divide the appropriate 
slopes (exponents) in Table 1 by the factor 1.7, we will obtain the slopes of 
the equal-sensation functions determined by matching with handgrip. How 
well this expectation is fulfilled is shown by the comparisons in Table 2. 
Despite the inevitable variability that plagues our attempt to study the 
input-output operating characteristics of the various sensory systems, we 
can apparently complete an interesting circle of validation with rather 
impressive precision. (For further details, see [16].) 

Let us turn finally to the domain of practical applications—an area 
that is not always without interest to the academic mind. As many readers 
will already know, the sone scale of loudness, the first and most thoroughly 
studied of the modern ratio scales of sensation, has long since proved its 
usefulness to the acoustical engineer. As a matter of fact, it was a commercial 
company that first made a serious attempt to erect a subjective ratio scale 
back in the 1930’s. The loudness scale, whose unit is the sone [9], recently 
performed its bit as an essential link in the development of a method for 
computing the total binaural loudness of a complex sound spectrum, given 
an analysis of the sound in terms of octave or third-octave bands [10]. The 
loudness in sones of each separate band of the noise is determined from a 
set of equal-loudness contours, and then the loudness values are weighted 
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TABLE 2 


The Exponents (Slopes) of Equal-Sensation Functions, as Predicted from Ratio Scales 








Ratio Scale————~ 
Exponent Scaling by Means of Handgrip 
of Power Stimulus Predicted Obtained 


Continuum Function Range Exponent Exponent 





Electric shock 
(60-cycle 3. 0.29-0.73 milliamp 2.06 
current ) 

Temperature 2.0-14.5°C. above 
(warm) : neutral temperature 0.94 

Heaviness of 
lifted weights ; 28-480 grams 0.85 

Pressure on 
palm : 0.5-5.0 pounds 0.65 

Temperature 3.3-30.6°C. below 
(cold) ; neutral temperature 

60-cycle 17-47 db re approximate 
vibration threshold 

Loudness of 55-95 db re 0.0002 
white noise , dyne/cm? 

Loudness of 47-87 db re 0.0002 
1000-cycle : dyne/cm? 
tone 

Brightness of 56-96 db re 10712 
white light ae lambert 





and added up according to a simple rule that was empirically determined. 
To the value in sones of the loudest octave band is added 0.3 times the sum 
of the sone values of the remaining bands. If the physical analysis of the 
sound happens to be made in third-octave bands, the rule remains the same, 
except that the factor 0.3 becomes 0.15. This method of loudness addition 
across-the frequency domain reflects the interactions among the bands: each 
band of noise adds to the total loudness, but neighboring bands also inhibit 
or mask one another. Hence there is no straight, unweighted addition of 
loudness as bands are added to the spectrum. A version of this calculation 
procedure [12] is in fairly widespread use, and it is being readied as a Secre- 
tariat Proposal for general adoption by the International Standards Organi- 
zation. 

The relevance of all this to the interests of the Fechner symposium may 
not be very great, but it shows that ratio scales of sensation have their 
utility in the world of practical decisions. These scales have greater demon- 
strable usefulness, I believe, than any Fechnerian scale yet devised. 
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DETECTION THEORY AND PSYCHOPHYSICS: A REVIEW* 


Joun A. SWETS 
MASSACHUSETTS INSTITUTE OF TECHN OLOGY 


I am particularly pleased to discuss the application in psychophysics 
of the general theory of signal detection at a symposium commemorating 
Fechner’s founding work. Although this effort is in part a theoretical and 
experimental critique of Fechner’s principal concepts and methods, which 
indicates that they should be replaced, I suspect that he would have welcomed 
it warmly—in fact, that he would have been among the first to recognize 
the value of these new tools had they become available in his time. I suspect 
this on the basis of his interest in Bernoulli’s early ideas about statistical 
decision ([3], p. 284), and in the notion of subthreshold, or ‘‘negative,”’ sensa- 
tions ([3], p. 293)—two central concepts in the psychophysical application 
of the theory of signal detectability. 

The theory of signal detectability (henceforth called TSD) was developed 
most fully in the years 1952-1954 by Peterson, Birdsall, and Fox [35] at the 
University of Michigan, and by Van Meter and Middleton [55] at the Massa- 
chusetts Institute of Technology. At the same time, although working apart 
from TSD, Smith and Wilson [39] at the Massachusetts Institute of Tech- 
nology and Munson and Karlin [33] at the Bell Telephone Laboratories were 
conducting psychoacoustic experiments that demonstrated the relevance of 
the theory to human observers; their experimental results led them to suggest 
a similar theory of the human detection process. Meanwhile, Tanner, and 
Swets [52, 53] were making a formal application of TSD in the field of vision. 

Since then, other general discussions and reviews similiar to this one 
have appeared; I should mention the 1955 paper by Swets, Tanner, and 
Birdsall [47] that includes a complete review of the data collected in vision; 
the 1956 paper by Tanner, Swets, and Green [54] that includes several 
studies in audition; Green’s [22] exposition in the current series of tutorial 
articles in the Journal of the Acoustical Society; and Licklider’s chapter [30] 
in the series edited by Koch for the American Psychological Association. 
The present discussion is distinguished from the first three of these in that 
it is not tutorial, detailed, nor documentary. The statistical and psycho- 
physical bases of the work are not considered here, and no data are presented. 
In this sense, the present discussion is most like Licklider’s. Its only ad- 
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vantage is that of time—it comes three years later and is based on three 
times as many papers. 

It will be helpful to emphasize at the outset (as Green [22] has suggested) 
that TSD is a combination of two distinct theoretical structures: (i) statistical 
decision theory, or the theory of statistical inference, as developed principally 
by Wald [59] who built upon the earlier work of Neyman and Pearson [34], 
and (ii) the theory of ideal observers, initiated by Siegert [see 29]. In TSD, 
statistical decision theory is used to treat the detection task as a decision 
process, specifically, as an instance of testing statistical hypotheses. This 
enables TSD to deal effectively with the long-standing problem in psycho- 
physics of the control and measurement of the criterion for signal existence 
that is employed by the observer. The theory of ideal observers makes it 
possible to relate the level of the detection performance attained by a real 
observer to the mathematically ideal detection performance. The mathe- 
matical ideal is the upper limit on the detection performance that is imposed 
by the environment. This limit is stated in terms of measurable parameters 
of the signal and of the masking noise for a variety of types of signal and 
noise. It is often instructive, as we shall see, to consider the nature of the 
discrepancy between observed and ideal performance. 

We have used TSD as a framework for the experimental study of sensory 
systems. The framework role suggests itself for the theory, because the 
theory specifies the nature of the detection process, and it defines the experi- 
mental methods that are appropriate, given its conception of the detection 
process. TSD is also, to a limited extent, but to a far larger extent than is 
generally recognized, a “‘substantive” theory of vision and audition. We 
have examined the correspondence between the human observer’s detection 
process and the process described by the theory. The methods that the 
theory prescribes have been compared with others available. We have ex- 
amined some substantive implications of the theory and have applied the 
theory and methods to other substantive problems. 

I shall discuss, in turn, how the decision-theory aspects and the ideal- 
observer concepts of TSD have been applied to human behavior, with an 
emphasis on theory and experimental method. Unfortunaiely, the time 
available will permit only a slight admixture of substantive results. I shall 
concentrate on the simplest detection task, mentioning only briefly extensions 
of the theory and experimental procedures to more complex perceptual tasks. 


Decision Aspects of Signal Detection 


As I have noted, the decision-theory part of TSD impressed us strongly 
at our first acquaintance because it gave promise of dealing with a difficult 
problem in psychophysics—namely, the determination of the dichotomy 
between the observer’s positive and negative reports, between stimuli he 
reports he does and does not see or hear, etc. To state the problem otherwise, 
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it is the definition of the observer’s criterion for making a positive response. 


Let us review briefly just how TSD deals with this problem. 


The Fundamental Detection Problem 
and the Concept of the Likelihood Ratio 


I shall first define a particular detection problem, a very simple one— 
in terms of TSD, the fundamental detection problem. The observer is in- 
structed to attend to a certain class of physical events (perhaps visual or 
auditory) that the experimenter generates during a specified interval of 
time, and to make a report following the interval about these events. Coinci- 
dent with the specified observation interval, presumably, is some (neural) 
activity of the relevant sensory system. This activity forms the sensory 
basis, a part of the total basis, for the observer’s report. This sensory response, 
as we shall call it for the moment (meanwhile noting that the sensory activity 
coincident with the specified temporal interval need not be entirely a response 
to the physical events produced by the experimenter), may be in fact either 
simple or complex, it may have many dimensions or few, it may be qualitative 
or quantitative, it may be anything—the exact, or even the general, nature 
of the actual sensory response is of no concern to the application of the theory. 

Only two assumptions are made about the sensory response. One is 
that_the sensory response that occurs in the presence of a given signal is 

variable. In particular, the response is perturbed by random interference or 
“noise”—noise produced inadvertently by the experimenter’s equipment for 
generating stimuli, or deliberately introduced by the experimenter, or inherent 
in the sensory system. It is assumed that-some noise, whatever its origin, is 
always present. Thus the sensory response will vary over time in the absence. 
of any signal, as well as vary from one presentation to the next of what is 
ostensibly the same signal. In the fundamental detection problem, the 
observation interval contains either the noise alone, or a specified signal and 
the noise. The observer's report is limited to these two classes of stimulus 
events—he says ‘‘Yes’’ (the interval contained the specified signal) or ‘“‘No”’ 
(the interval did not contain the specified signal, i.e., it contained noise 
alone). Note that he does not say whether or not he heard (or saw) the 
signal—but whether or not, under the circumstances, he prefers the decision 
that it was present to the decision that it was absent. 

The second assumption made about the sensory response is that, what- 
ever it may be in fact, it may be represented (insofar as it affects the observer’s 
report) as a unidimensional variable. In particular, the observer is assumed to 
be aware of the probability that each possible sensory response will occur 
during an observation interval containing noise alone, and also during an 
observation interval containing a signal in addition to the noise. He is assumed 
to base his report on the ratio of these two quantities, the likelihood ratio. 
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The likelihood ratio derived from any observation interval is a real, nonzero 
number and may thus be represented along a single dimension. 


The Likelihood Ratio Criterion 


According to TSD, following statistical decision theory, the observer's 
report following an observation interval will depend upon whether or not 
the likelihood ratio measured in that interval exceeds some critical value 
of the likelihood ratio, some criterion. The criterion (or decision rule) is 
presumed to be established by the observer in accordance with his detection 
goal and the relevant situational parameters. For example, if his goal is to 
maximize the expected value of his decisions, i.e., to maximize his total 
payoff over several trials in which each of the four possible decision outcomes 
has a value associated with it, then his criterion will depend on these values 
and on the a priori probability that a signal will occur on a given trial. 
Statistical decision theory prescribes the optimal criterion for any values 
assumed by this set of parameters. (We shall discuss shortly the relationship 
between the optimal criterion and the criterion used by the human observer.) 
The observer may try to achieve any of a number of other detection goals, 
and there will be, in general, a different optimal criterion corresponding to 
each goal. The criterion under each of these goals, i.e., under each definition 
of optimum, can be expressed in terms of the likelihood ratio [35, 59]. (It 
can be shown that the optimal criterion is defined equally as well on any mono- 
tonic function of the likelihood ratio—the number corresponding to the 
criterion will differ, but the decisions will be the same.) 

We next consider a probability defined on the variable likelihood ratio, 
in particular, the probability that each value of the likelihood ratio will 
occur with each of the classes of possible stimulus events, noise alone, and 
signal plus noise. We have, therefore, two probability distributions. The 
one associated with signal plus noise will have a greater mean—indeed, its 
mean is assumed to increase monotonically with increases in the signal 
strength. If the observer follows the procedure described (i.e., if he reports 
that the signal that was specified is present whenever the likelihood ratio 
exceeds a certain criterion, and that noise alone is present whenever it is 
less than this criterion) then, from the four-fold stimulus-response matrix 
that results, we can extract two independent, quantitative measures—a 
measure of the observer’s criterion and a measure of his sensitivity. 


The Operating Characteristic 


The important concept here is the operating characteristic (OC). Suppose 
we induce the observer to change his criterion from one set of trials to another. 
Suppose for each criterion we plot the proportion of Yes reports made when 
the signal was present (the proportion of correct detections, or hits) versus 
the proportion of Yes reports made when noise alone was present (the pro- 
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portion of false alarms). Then as the criterion varies, we trace a single curve 
(running from 0 to 1.0 on both coordinates) showing the proportion of hits 
to be a nondecreasing function of the proportion of false alarms. This curve 
describes completely the successive stimulus-response matrices that are ob- 
tained, since the complements of these two proportions are the proportions 
that belong in the other two cells of the matrix. The particular curve generated 
in this fashion depends upon the signal and noise parameters and upon 
the observer’s sensitivity. The point on this curve that corresponds to any 
given stimulus-response matrix represents the criterion employed by the 
observer in generating that matrix. 

We have found that, to a good approximation, the OC curves produced 
by human observers correspond to theoretical OC curves based on normal 
probability distributions [47, 54]. (This is an empirical fact; it is not necessary 
to make in advance any particular assumptions about the form of the proba- 
bility distributions in order to use the OC analysis.) These OC curves have 
the convenient property of being characterized by a single parameter: the 
difference between the means of the signal-plus-noise and noise-alone distri- 
butions divided by the standard deviation of the noise distribution. This 
parameter has been called d’. Further, the slope of the curve at any point 
is equal to the value of the likelihood ratio criterion that produces that point. 
Thus, to repeat, from any stimulus-response matrix, one can obtain two 
independent measures: one an index of the sensitivity of the observer to 
the particular signal and noise used, the other an index of his criterion. 


The Experimental Invariance of d’ 


It has been shown experimentally, in both vision [47] and audition [54], 
that the measure d’ remains relatively constant with changes in the criterion. 
Thus, TSD provides a measure of sensitivity that is practically uncontami- 
nated by attitudinal or motivational variables, i.e., by variables that might 
be expected to affect the observer's criterion. 

It has also been shown that the measure d’ remains relatively invariant over 
different experimental procedures. Specifically, we have compared estimates 
of d’ obtained from the Yes-No procedure (i.e., the fundamental detection 
problem) with estimates of d’ obtained from a forced-choice procedure. 
Under forced-choice, n temporal intervals were defined on each trial, exactly 
one of which contained the signal; the observer selected one of the n intervals. 
The probability of a correct response as a function of d’ can be calculated 
by using the empirical OC curve or by making some assumptions about 
the form of the probability distributions. We have calculated this probability 
as a function of d’, under the assumptions that the probability distributions 
are normal and of equal variance. The relations between the probability of 
a correct response and d’, for several different numbers of alternatives in a 
forced-choice trial, have been tabulated [15]. 
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For both vision [52] and audition [41] the estimates of d’ from the Yes-No 
procedure and from the four-interval forced-choice procedure are very nearly 
the same. We have further found remarkably consistent estimates of d’ 
from forced-choice procedures with 2, 3, 4, 6, and 8 intervals [41]. Also, the 
rating procedure in which the observer chooses one of several categories of 
certainty about the presence of a signal in a single interval yields OC curves 
(or estimates of d’) in both vision [42] and audition [14], that are indis- 
tinguishable from those obtained with the Yes-No procedure. These results 
are particularly fortunate since, as Licklider has pointed out, they represent 
“‘a. break in the trend ... to regard the results of a psychophysical test as 
meaningless except in relation to the procedure of the test’’ ({30], p. 73). 
It is evidently possible to have one rather than many psychophysical sciences. 


The Observer's Criterion and the Optimal Criterion 


I have mentioned earlier that we would return to a brief discussion of 
the relationship of the observer’s criterion to the criterion specified as optimal 
by statistical decision theory. Let us note, first, that TSD can be a useful 
tool in analyzing psychophysical data if it is merely the case that the ob- 
server controls a criterion (in terms of the likelihood ratio or some monotonic 
function of the likelihood ratio), whether or not his criterion bears any 
relationship to the optimal one. We have worked with the expected-value 
definition of the optimal criterion (in which the optimal criterion is a function 
of the a priori probability of signal occurrence and the values associated with 
the decision outcomes) simply because the manipulation of these parameters 
is a convenient and effective way of inducing the observer to change his 
criterion in a psychophysical experiment. This manipulation enables us to 
trace an empirical OC curve. Strictly speaking, whether or not the observer 
adjusts his criterion in terms of these variables in everyday life is of no 
concern in the laboratory. Neither are we concerned with subjective proba- 
bilities or the linearity of the utility of money. We are, of course, interested 
in the result we observed: the observer's successive criteria in our experiments 
were highly correlated with the optimal expected-value criteria, that is, the 
human observer is capable of approximating the optimal criterion. Even 
though our experimental situation is artificial in this respect, the result 
teaches us something about the observer’s capabilities. 

I should quickly add that it is difficult to compare the observer's criterion 
and the optimal criterion in more than a correlational sense. The reason for 
this is that the function relating the expected value to the criterion is very 
flat: any criterion within a very wide range (a range of as much as 0.40 in 
false-alarm rate) will result in a payoff that is at least 90 percent of the 
maximum payoff possible. Nonetheless, the fact that our observers respond 
to a change in a priori probabilities or decision values by setting a new 
criterion—the fact that they can adopt successively and repeatedly as many 
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as five criteria showing a perfect rank-order correlation with the optimal 
criteria—has made it possible to achieve in our experiments what Graham 
has called ‘‘a quantification of the instruction stimuli” ({18], p. 868). Egan 
(who has been successful in controlling the criterion with such verbal instruc- 
tions as ‘‘lax,” ‘‘moderate,”’ and “‘strict’”’ [see 14]) is now preparing to examine, 
in more detail than was done previously, the relationship between the ob- 
server's criterion and various definitions of the optimal criterion, i.e., the 
relationship between quantitative instruction stimuli and quantitative esti- 
mates of the resulting criterion [10]. 


Implications for Psychophysical Methods 


I would now like to return to the proposition that TSD provides a frame- 
work for the study of sensory systems, in particular, that it specifies the 
experimental procedures that may be properly used given its conception of 
the nature of the detection process. 

Perhaps the most salient procedural implications of TSD are those 
concerned with the use of catch trials (trials containing noise alone) and the 
treatment of positive (Yes) responses made on those trials. As I have re- 
marked, when we first became acquainted with TSD, it was apparent that 
the theory would be valuable in psychophysics because it spoke forth elo- 
quently on exactly those conceptual and procedural problems that many 
of us believed to be handled inadequately in classical psychophysics. Licklider 
has expressed our discomfiture very aptly: ‘‘More and more, workers in the 
field are growing dissatisfied with the classical psychophysical techniques, 
particularly with the method of ‘adjustment’ or ‘production’ that lets the 
listener attend to the stimulus for an unspecified length of time before deciding 
that he can ‘just hear it’ and with the methods of ‘limits’ and ‘constants’ 
(in their usual forms) that ask the listener to report ‘present’ or ‘absent’ 
when he already knows ‘present.’ It is widely felt that the ‘tnresholds’ yielded 
by these procedures are on such an insecure semantic basis that they cannot 
serve as good building blocks for a quantitative science’ ((30], p. 75). 

It is simply not adequate to employ a few catch trials, enough to monitor 
the observer, and then to remind him to avoid false-positive responses each 
time one is made. This procedure merely serves to drive the criterion up 
to a point where it cannot be measured, and it can be shown that the calcu- 
lated threshold varies by as much as 6 db as the criterion varies in this un- 
measurable range. Precision is also sacrificed when highly trained observers 
are employed along with the untestable assumption that they do maintain 
a constant high criterion. Even if all laboratories should be fortunate enough 
to have such observers, the figure of 6 db is a reasonable estimate of the 
range of variation among ‘‘constant criterion’ observers in different labora- 
tories. To be sure, for some problems, this amount of variability is not bother- 
some; for others, however, it is. 
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Experiments have also shown that the use of enough catch trials to 
provide a good estimate of the probability of a false-positive response will 
again leave one far short of the precision aitainable—if this estimate is 
used to correct the data on signal trials for spurious positive responses on the 
grounds that these responses are independent of sensory-determined positive 
responses. Suffice it to say here that several experimenters, in a number of 
different experiments in different laboratories, have failed to find any evidence 
for the independent existence of a sensory threshold, a threshold that is 
independent of the probability of a false-positive report. We find response 
thresholds aplenty—any criterion used by the observer is such—but no 
measurable sensory thresholds. Of course, the validity of the various classical 
procedures depends directly on the validity of the concept of a sensory 
threshold. 

In view of a large collection of data, it seems to me to be only reasonable 
(1) to add enough noise in psychophysical experiments (a background of 
some kind) to bring the noise to a level where it can be measured, (2) to 
manipulate the response criterion so that it lies in a range where it can be 
measured, (3) to include enough catch trials to obtain a good estimate of 
this response criterion, and (4) to use a method of analysis that yields a 
measure of sensitivity that is independent of the response criterion. This 
prescription will stand, I believe, until such time that it becomes possible to 
demonstrate that all traces of noise can be eliminated from a sensory experi- 


ment. The only other qualifying remark I would make is a positive one: 
we can forego estimating the response criterion in a forced-choice experiment. 
Experience has shown that we can reasonably view the observer as choosing 
the interval having the greatest ‘‘sensory response’ associated with it, without 
regard to any criterion. For this reason, the forced-choice procedure is a 
highly desirable procedure for use in purely sensory studies, 


The Theory of Ideal Observers 


The portion of TSD that pertains to the optimal or ideal performance 
in the sense of detectability, or sensitivity, rather than the decision criterion, 
is known as the theory of ideal observers. This theory gives, for several types 
of signal and noise, the maximum possible detectability as a function of the 
parameters of the signal and noise [35]. Under certain assumptions, this 
relationship can be stated precisely. We have regarded the case of the ‘‘signal 
specified exactly”’ (in which everything about the signal is known, including 
its frequency, phase, starting time, duration, and amplitude) as a useful 
standard in psychoacoustic experiments. In this case, the maximum d’ 
is equal to +/2E/N, in which E£ is the signal energy, or time integral of power, 
and N, is the noise power in a one-cycle band. Recently, an ideal observer 
applicable to visual signals has also been developed [51]. 

We believe that a theory of ideal performance is a good starting point 
in working toward a descriptive theory. Ideal theories involve few variables, 
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and these are simply described. Experiments can be used to uncover whatever 
additional variables may be needed to describe the performance of real 
observers. Alternatively, experiments can be used to indicate how the ideal 
theory can be degraded, i.e., to identify those functions of which the ideal 
detection device must be deprived, in,order to describe accurately real 
behavior. Measures can be defined to describe the real observer’s efficiency; 
Tanner and Birdsall [50] have, for example, described the efficiency measure 
7 as used in the expression 


bos = 1) V2E/Ng 


in which the value of d’ is that observed experimentally. They have suggested 
that substantive problems may be illuminated by the computation of 7 
for different types of signals and for different parameters of a given type of 
signal. The observed variation of this measure should be helpful in deter- 
mining the range over which the human observer can adjust the parameters 
of his sensory system to match different signal parameters. (He is, after 
all, quite proficient in detecting a surprisingly large number of different 
signals.) This variation should also be helpful in determining which parame- 
ters of a signal the observer is not using, or not using precisely, in his detection 
process. 

The human observer, of course, performs less well than does the ideal 
observer in the great majority of, if not in all, detection tasks. The interesting 
question, one that usually turns out to be a heuristic question, concerns not 
the amount but the nature of the discrepancy that is observed. The next 
few paragraphs illustrate how asking this question led to information about 
certain substantive issues in audition and vision. 

(We find that the human observer performs less well than the ideal 
observer defined for the case of the ‘‘signal specified exactly.” That is, the 
human observer’s “psychometric function” (the proportion of correct posi- 
tive responses as a function of the signal energy) is shifted to the right. 
Further, the slope of the human observer’s function is greater than that of 
the ideal function for this particular case—a result sometimes referred to 
as low-signal suppression. Let us consider three possible reasons for these 
discrepancies.) 

(First, the human observer may well have a noisy decision process, 
whereas the ideal decision process is noiseless. For example, the human 
observer’s criterion may be unstable. If he vacillates between two criteria, 
the resulting point of his OC curve will be on a straight line connecting the 
points corresponding to the two criteria; this average point falls below the 
OC curve (a curve with smoothly decreasing slope) on which the two criteria 
are located) Again, the observer’s decision axis may not be continuous—as 
far as we know, it may be divided into a relatively small number of categories. 
This possibility has not been studied intensively. 

A second likely reason for the deviation from ideal is the noise inherent 
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in_the human sensory systems) We have attempted to estimate the amount 
of internal noise (including noise in the decision process and in the sensory 
system) in two ways: by examining the decisions of an observer over several 
presentations of the same signal and noise (on tape) [46], and by examining 
the correlation among the responses of several observers to a single presen- 
tation [2, 23]. Both Egan and Green [10, 24] are continuing this line of work 
with taped presentations. 

A third, and favored, possibility is faulty memory. This explanation is 
favored, a priori, because it easily accounts not only for the shift of the 
human’s psychometric function but also for the greater slope of his function. 
The reasoning proceeds as follows: if the detection process involves some 
sort of tuning of the receptive apparatus, and if the observer’s memory of 
the characteristics of the incoming signal is faulty (these characteristics 
being amplitude, frequency, starting time, duration, phase in audition, and 
location in vision), then the observer is essentially confronted with a signal 
not specified exactly, but specified only statistically. He has some uncertainty 
about the incoming signal. When we introduce some uncertainty into our 
calculations of the psychometric function of the ideal detector [35], we 
find that performance falls off as uncertainty increases, and that this decline 
in performance is greater for weak signals than strong ones. That is, a family 
of theoretical uncertainty curves shows progressively steeper slopes coinciding 
with progressive shifts to the right. This is intuitively proper—the accuracy 
of knowledge about signal characteristics is less critical for strong signals, 
since strong signals carry more information about these characteristics with 
themselves. For visual [51] and auditory data [22] the slopes are fitted well 
by the theoretical curve that corresponds to uncertainty among approximately 
100 orthogonal signal alternatives. It is not difficult to imagine that the prod- 
uct of the uncertainties about the time, location, and frequency of the signals 
used in these experiments could be as high as 100. 

It is possible to obtain empirical corroboration of this theoretical analysis 
of uncertainty due to faulty memory. This is achieved by providing various 
aids to memory within the experimental procedure. (Tanner [48] has used 
this technique in auditory studies; Green [22] has replicated his results in 
audition, and Green and Swets [28] have obtained similar results in visual 
studies.) In these experiments, the memory for frequency is made unnecessary 
by introducing a continuous tone or light (a carrier) of the same frequency 
as the signal, so that the signal to be detected is an increment in the carrier. 
This procedure also eliminates the need for phase memory in audition and 
location memory in vision. In further experiments, instead of a continuous 
carrier, a pulsed carrier (one that starts and stops along with the signal) 
is used in order to make memory for starting time and duration unnecessary. 
In all of these experiments a forced-choice procedure is used, so that the 
memory for amplitude beyond a single trial can also be considered irrelevant. 
In this way, all of the information thought to be relevant may be contained 





JOHN A. SWETS 59 


in the immediate situation. Experimentally, our observers’ psychometric 
functions show progressively flatter slopes as we introduce more and more 
memory aids in this way. In fact, when all of the aids mentioned above are 
used, the observer’s slope parallels that for the ideal observer without un- 
certainty, and it is as little as 3 db from the ideal curve in absolute value. 


Examples of Substantive Problems Studied in this Framework 


I shall simply list some other examples of substantive problems in 
vision and audition that have been studied within the framework of TSD. 
A way of treating one set of related problems follows directly from sta- 
tistical theory. These are problems in which the observation is expanded in 
one way or another: (i) the observation interval is lengthened, (ii) the number 
of observations preceding a decision is increased, (iii) the number of signals 
presented in a given interval is increased, or (iv) the number of observers 
concentrating on the same signal is increased. From statistical theory, the 
distribution of the sum of nm random variables with the same mean and 
variance has a mean equal to n times the mean and a variance equal to n 
times the variance. Since the measure of detectability d’ is equal to the mean 
of the hypothetica! probability distribution that is due to the signal (we 
can consider the mean of the noise distribution to be zero) divided by the 
standard deviation (the square root of the variance), we would predict that, 
as the number of signals or observations is increased, the measure d’ will 
increase as the square root of the number. Or we might think in terms of 
the result in statistics that the standard deviation of the estimate of a popula- 
tion mean decreases as the square root of the number of samples, and be led 
to the same prediction. Experiments in audition have shown d’ to increase, 
within certain limits, as ~/t, where ¢ is the duration of a tone-burst signal 
in white noise [26], and to increase as +~/n, where n is the number of tonal 
components of a complex signal (up to 16) [27], or the number of observation 
intervals preceding a decision (up to five) [46], or the number of observers 
(up to three) [2]. 

A persistent favorite among the substantive problems in audition that 
have been studied experimentally in this setting is the problem of frequency 
analysis, or the problem of the critical band. Fletcher’s [17] original experi- 
ment, in which the width of the band of noise that masked a tonal signal 
was varied systematically, has been repeated with signals of a specified 
duration and with a different type of analysis [54]. A variety of other pro- 
cedures has been used in the attack on this problem, including signals of 
uncertain frequency [7, 20, 46, 54, 56, 57], complex signals compounded of 
several frequencies [21, 27, 31], and signals consisting of a band of noise [19]. 
Other studies of substantive problems that were guided by TSD include 
studies of reaction time [45], physiological recording [16], and some aspects 
of color vision [40]. 
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Extended Applications to More Complex Problems 


I have spoken, so far, primarily about simple detection procedures 
(Yes-No and forced-choice procedures involving a single, specified observa- 
tion interval) employing single, simple signals (a tone burst in white noise, 
or a spot of light on a larger uniforra background). I should at least mention 
some of the areas in which extensions of the basic theory have been applied 
to somewhat more complex problems. 

The theory has been applied to the recognition or identification process 
as well as to the detection process: to the recognition of one of two frequencies 
[49], and to problems requiring both detection and recognition [43]. Very 
recently, following Anderson’s [1] work in statistics, some preliminary at- 
tempts have been made to deal with the recognition of one of a large number 
of signals and thus to move closer to some significant problems in percep- 
tion [25, 32]. 

Following Wald’s [58] developments in sequential analysis, we have 
studied the problem of deferred decision [44]. In this problem, the observer 
decides after each observation interval whether to make a terminal decision 
(Yes or No) or to request another observation before making a terminal 
decision. Values and costs are assigned to the possible outcomes of a terminal 
decision, and a fixed cost is assessed for each additional observation that 
is requested. This deferred decision paradigm, it will be recognized, represents 
another way of striking out in the direction of realism. We have examined 
in this way the trading relationship that commonly exists in perceptual 
tasks between time and error. In a similar vein, the theory has recently been 
extended to handle the case of detection in which the observation interval 
is not specified for the observer—the so-called vigilance, or low-probability 
watch, problem [13]. 

Still another more complex area of research that has received extensive 
study is that of speech communication. Many of the techniques employed 
in studies with simple signals have been used with success here, including 
operating characteristics, confidence ratings, repetition of items, deferred 
decisions, and analyses in terms of the degree of uncertainty that exists 
[4, 5, 6, 8, 9, 11, 12, 36, 37, 38].* 


*A scheme for organizing the rather extensive bibliography of this paper (and other 
related publications) may be helpful. As it happens, an organization is most effectively 
accomplished in terms of personalities. After the original applications of TSD, Tanner, 
Green, and I have gone somewhat separate ways. Tanner’s efforts have been directed 
principally toward an examination of the role of models in experimental studies and toward 
applications of the concept of the ideal observer; Green has concentrated on substantive 
problems in audition, especially the problem of frequency analysis; I have been asain 
concerned with the decision process and with applications of the theory to psychophysical 
tasks that are less idealized than those studied initially. Egan, Pollack, and Clarke have 
employed the theory in studying various problems in speech communication and have 
contributed a number of studies dealing with methodological problems. 
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PSYCHOMETRICS—A SPECIAL CASE 
OF THE BRAHMAN THEORY 


Jack W. DuNLAP 
DUNLAP AND ASSOCIATES, INC. 


We are gathered today to commemorate the completion of the first 
quarter of a century of the Society’s history, and I consider it no small 
honor to have been selected to address this distinguished group on this 
memorable occasion. When Dr. Charles Wrigley, Chairman of the Program 
Committee, extended the invitation, he expressed the hope that I would 
review the history of the formation of the Society and discuss something of 
its achievements to date. Today I shall only highlight the early history of 
the organization, since in my Presidential! Address in September 1941 a 
rather full history was presented. 

The policy and philosophy laia down initially have, in my opinion, 
proved sound over the past quarter century. Careful re-examination con- 
vinces me that these will be sound not only for the next quarter century, 
but probably for the next century as well. It is my firm conviction that the 


Society and the Corporation through the Journal, Psychometrika, have made 
a substantial contribution to psychology. It is proper, however, and perhaps 
timely to re-examine our past contributions to psychology and society and 
perhaps to question whether we have become stagnant or made fully the 
contribution envisaged by the founders. Before reviewing our contribution, 
it might be wise to review briefly the early history of these organizations. 


1931 


Dr. A. P. Horst was attempting to buy or form a journal to be devoted 
to quantitative methods as applied to education and psychology. 


1932 


Horst interested his associate, Dr. A. K. Kurtz, in the need for such 
a journal and they began preliminary correspondence with other psycho- 
logists interested in quantitative applications. 


1933 


Horst and Kurtz discussed the problem at great length with Drs. L. L. 
Thurstone and Marion W. Richardson. The idea appealed strongly to Thurs- 
tone and won his enthusiastic support without which it is doubtful success 
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would have crowned Horst’s efforts. Richardson’s interest in theoretical 
problems of test construction assured his support. The speaker was approached 
because of his knowledge and the volume of technical material available 
to him as editor of the Journal of Educational Psychology, the only outlet 
of consequence at that time for such articles. 


1934 


Thurstone became quite active in advising on the details of establishing 
such a journal and made a number of attempts to secure support from founda- 
tions, but without success. Horst, Kurtz, Richardson, and Stalnaker developed 
details as to costs, prospective publishers, policies and methodologies. At 
the fall meeting of the American Psychological Association, these men met 
with the speaker in a series of conferences, and became a firmly knit group 
determined to establish the new Journal. Kurtz emphasized that the readers 
of the Journal would be interested in forming a society, since this would 
identify individuals with common interests, focus attention on the need 
and importance of developing a quantitative rationale for psychology, pro- 
vide a mechanism for physical meetings and, perhaps most important, 
provide financial support for the Journal. 


1935 


Thurstone made it possible through contributions of his own and his 


staff’s time and facilities to canvass biometricians, educators, psychologists, 
and statisticians as to their interest in the proposed Society and Journal. 
Invitations were extended to all who replied, to attend the formation of 
the Society on September 4, 1935, during the annual meeting of the American 
Psychological Association. The new Psychometric Society was formally affili- 
ated with the APA at the annual meeting later that week. 

Temporary officers had been appointed at the formation meeting, and 
subsequently ballots were mailed out for election of officers. Thurstone 
was elected President; Horst, Secretary; and the speaker, Treasurer. 


1936 


A committee composed of Horst, Kurtz, and Richardson prepared the 
constitution for the Society, and it was adopted at the Dartmouth meeting 
in September. There was still no capital for starting the Journal. The speaker 
had estimated publication costs as approximately a thousand dollars a year, 
which subsequently proved approximately correct. Suddenly, shortly after 
the beginning of 1936, Horst, impatient with the delay and with confidence 
in the future, offered to underwrite the losses of the Journal for the first year 
up to one fourth of its cost. This example was immediately followed by 
Kurtz, Richardson, and E. L. Thorndike and, so as not to appear too nig- 
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gardly, Dunlap. Somehow the word got about as to the plans for initial 
financing of the Journal and it was only a very short time until pledges of 
support had been received from Guilford, Gulliksen, Kuder, Lorge, and 
Stalnaker. I might add these loans were repaid by September 1941. 

With sufficient funds to underwrite the publication of the Journal for 
a year and a half, the project moved ahead rapidly. In July 1936, Thurstone 
wrote the speaker as follows: 

‘“‘We have been discussing the relation of Psychometrika to the Psycho- 
metric Society. It seems best to organize the Psychometric Corporation 
whose principal function would be to publish the Journal. Members of the 
corporation would be those who had made some contribution for its support. 

“The principal reason why we suggest the Psychometric Corporation 
as the owner and publisher of the Journal is that the objectives of the Journal 
should be maintained. There is, of course, some possibility that the popular 
control of Psychometrika might change its character in a few years to a 
popular mental test Journal. This is, of course, not our principal objective 
and it seems best therefore to place the ownership and control of the editorial 
policies in the hands of the initial group with such additions as they may 
elect. Eventually we might have to make some compromises in the direction 
of popular mental test material, but such questions of policies should be in 
the hands of those who initiated and sponsored the Journal financially rather 
than in the hands of popular vote. This is what we have in the back of our 
minds in arranging for a partial separation in controlling the Journal.” 

On July 25, Dr. Thurstone sent the following instructions to the lawyers. 


1. The members of the Psychometric Corporation shall be mem- 
bers of the Psychometric Society. 

2. The initial members of the Psychometric Corporation are: 
Jack W. Dunlap Albert K. Kurtz 
J. P. Guilford Marion W. Richardson 
Harold O. Gulliksen John M. Stalnaker 
Paul Horst L. L. Thurstone 

. New members are to be elected by three-fourths majority of 

the membership. 


The Psychometric Corporation was incorporated in the State of Illinois on 
August 24, 1936. At the September meeting of the Corporation, Kuder, 
Rulon, Lorge, and Thorndike were elected to membership. 

In concluding the early history of the Society and the Corporation, 
I must touch on a sad note. Three of our first Presidents—Dr. L. L. 
Thurstone, Dr. Edward L. Thorndike, and Dr. Karl Holzinger—are no 
longer with us. Individually and as a Society, we owe a great deal to these 
distinguished leaders in our fields for the encouragement, stimulation and 
direction they gave. The fifth President, Dr. J. P. Guilford, is with us 
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today, and I am pleased to announce the sixth President is still around. 

While the Journal proved a financial success from the beginning, all 
was not well with the Society. The Society was barely able to get sufficient 
papers in 1940 to hold a program at the annual meeting, and in 1941 it had 
to forego a program for lack of papers. In the Presidential address that 
year, I raised a series of sharp questions to the membership and indicated 
a number of areas in which work was badly needed. Despite the fact that 
we were shortly engaged in World War II, the membership responded and 
the Editors were able to publish the Journal throughout the war. 

It would be well now to review again the objectives laid down by the 
founders of the Society and to examine how well they built. On the inside 
cover of the first issue of the Journal the following statement appeared. 


The Psychometric Corporation was organized for the purpose of sponsoring and publish- 
ing a professional journal on the following subjects. 


1. The development of quantitative rationale for the solution of psychological problems. 

2. New mathematical and statistical techniques for the evaluation of psychological data. 

3. Aids in the application of statistical techniques, such as nomographs, tables, work- 
sheet layouts, forms, and apparatus. 

4. Critiques or reviews of significant studies involving the use of quantitative techniques. 

5. General theoretical articles on quantitative methodology in the social and biological 
sciences, 


In the June 1960 issue, these same five objectives were re-affirmed on the 
masthead as the continued philosophy of the Psychometric Corporation. 

In an effort to determine how successfully the policies have been main- 
tained over a quarter of a century, I reviewed the four issues of Volume 1 
and the last two issues of Volume 24 and the first two issues of Volume 25 
of Psychometrika. I then classified the articles as to the five basic objectives 
set forth above, with the following results. No claim is made as to the accuracy 
of classification, but at least the material in all issues was classified against 
common criteria by a single rater. 


Vol. 1 Vol. 24, Vol. 25 


11 
13 
5 
2 
0 


Total 30 31 








Development of quantitative rationale 

New mathematical and statistical techniques 
Aids in application of statistical techniques 
Critiques of significant studies 

General theoretical articles 
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Casual examination indicates there has been little, if any, shift in the 
distribution of articles appearing in the Journal over a quarter of a century. 
Clearly, we have achieved one of the goals set forth in Dr. Thurstone’s 
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letter of July 1936. There is, however, one disturbing element in this table, 
namely, the paucity of basic theoretical articles. 

For several years now I have had the impression that the Journal was 
becoming stagnant, limited in authorship, and extremely narrow in its con- 
cept of what constitutes the development of quantitative rationale for the 
solution of psychological problems. Perhaps this impression stems from the 
rigidity that comes with age. In order to test this impression, I developed 
a list of thirteen topics which, after discussion with others, seem appropriate 
subjects for publication in the Journal and areas of possible research for the 
membership of the Society. Twenty issues, published from 1955 through 
1959, were examined and the articles classified according to topic. The 
results are set forth below. 


Nature of articles published in Psychometrika—1955 through 1959 


Statistics of tests and measurements 65 
Factor analysis 34 
Statistics—general 24 
Mathematical models of human behavior 16 
Test theory 

Learning theory 

Game theory 
Information theory 
Detection theory 
Communication theory 
Decision theory 
System theory 

Servo theory 
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Total 


The average number of articles is 29.6, almost identical with the number 
in Volume 1. The cover of the Journal is unchanged, the format unchanged, 
and the nature of the contents interchangeable with those of Volume 1. 
I might add that eight of the original editors are still on the editorial board. 
Clearly, the fear expressed in Dr. Thurstone’s letter of July 1936 that the 
Journal might become a popular mental test journal has not materialized. 
In view of Drs. Thurstone’s and Horst’s sensitivity to new frontiers, it is 
interesting to speculate as to their reaction to the fact that the Journal 
still devotes two-thirds of its space to articles on the statistics of tests and 
measurements and factor analysis. 

True, the Journal and the Society have escaped the ‘fate worse than 
death” of becoming a popular mental test journal, but perhaps we have 
fallen victim to an equally deplorable fate of being a ‘‘popular test journal.”’ 
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Surely it is apparent by now that the Psychometric Society and the Journal 
are classic examples of Brahman theory. 

@'When Paul Horst and Marion Richardson came riding out of the West 
to the meeting in the fall of 1934 with the dream of establishing a journal 
for the development of quantitative rationale for the solution of psychological 
problems and for general theoretical articles on quantitative methodology 
in the social and biological sciences, it is doubtful they envisaged factor 
analysis and the statistics of tests as the “‘sine qua non” of psychology, 
nor do I believe this was the belief of the other six founders. Today, the 
form and the ritual remain, but the spirit of developing new frontiers is no 
longer apparent in the pages of the Journal. We have become complacent, 
perhaps authoritarian, and unimaginative in our approach to new oppor- 
tunities in developing a quantitative science. We need new blood, new ideas, 
and the audacity of youth to lead us in the further development of our 
science. We cannot view lightly that some of the most important develop- 
ments in recent years in the field of quantitative psychology have been made 
by persons who have been trained primarily in physics and mathematics, 
as exemplified by the contributions of Robert Bush and Fred Mosteller. 

In the July 14, 1960, issue of The Listener, an English publication, 
George Steiner wrote a shocking indictment of psychometrics. 


Vehement Obscurity 

Finally there are those pursuits that call themselves, significantly, the social sciences. 
As practised by their exponents, particularly in Germany and America, they are largely 
illiterate or anti-literate. Their papers and books are written in a jargon of vehement 
obscurity. Wherever they can, they replace the verbal concept with the mathematical or 
statistical expression, the curve, the graph. Where they cannot, they inject into language 
pseudo-words borrowed from the exact sciences (‘‘norms,” “group,” “‘scatter,” “functions,” 
‘{ntegrations”). All these are words with a specific mathematical or notational content. 
Emptied of it they become the pretentious, deceptive jargon of the American sociologist, 
and in using such jargon he pays eloquent tribute to the fact that all exact knowledge 
must seek to assume the respectability of the natural and mathematical sciences. 


Perhaps the social sciences deserve this blast, but I am convinced Steiner 
overstated his case. Nevertheless, his remarks should stimulate us to review 
our work. 

There are a number of areas notably absent from Psychometrika which, 
in practice, appear to fall squarely within its stated realm of developing 
quantitative rationale and general theoretical articles on quantitative metho- 
dology. Consider, for example, information theory which permits us to des- 
cribe quantitatively the rate at which man processes information. Only one 
article has appeared in Psychometrika on this very important area of human 
activity. True, there are limitations to the technique, but there are some 
strong advantages in using this technique to describe one aspect of human 
behavior. The problem of describing the behavior of man in continuous 
control tasks, where he is constantly receiving inputs and his outputs are 
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attempts to correct the inputs, is dealt with in servo-theory but no articles 
deal with this important area. 

The problems of risk-taking by individuals and groups are real life 
problems, but only one article has appeared in the Journal on game theory. 
Nevertheless, the applications of human game playing in both military and 
management games are well known and certainly deserve consideration under 
the statement of principles to which we have rendered lip service for a quarter 
of a century. 

In the last five years, only eight articles have appeared in the Journal 
which could be classified as mathematical models of human behavior, and 
only one each in the important areas of human behavior of game theory, 
information theory, and detection theory. 

One of the most important aspects of human behavior is communications; 
yet no attention has been paid to problems of network theory, cybernetics, 
the signal-to-noise problems, or of how man functions in a communication 
channel. Closely related are the very real problems of detection theory as 
exemplified in discrimination, vigilance, and signal detection in the presence 
of noise, but detection theory has been ignored by psychometricians. 

Man is constantly confronted with the necessity of making decisions. 
Even now many of you are contemplating the choice of ‘Shall we hear him 
out or throw him out?” Quantitative methods and stochastic models have 
been developed by others regarding the theories of decision-making; yet we 
appear to ignore sedately this important aspect of human behavior. 

It is only in the field of mathematical and mechanical models of human 
behavior that we have been willing to stray from the narrow and rigid pattern 
set by the first few years of publication. Nowhere in the last five years did I 
find an attempt to apply queueing theory or linear programing and only one 
instance of the application of Monte Carlo theory. True, none of the above 
areas may be worthwhile for developing a quantitative rationale for psychol- 
ogy, but they surely deserve exploration. The least we could have done was 
to have prepared critiques of significant studies using quantitative techniques. 

The basic objectives of the Society and the Journal are, I repeat, as 
sound today as they were when they were formulated more than a quarter of 
a century ago. I do not contend for a moment that we should do away with 
articles dealing with statistics of tests or of factor analysis, but rather we 
should consider modifying the ‘‘mix’’ of articles to encompass a broader 
concept of the development of a quantitative rationale for our field. We 
would be well advised, in my opinion, to forego the Brahman theory* and 
begin to search actively for articles on these newer techniques and approaches, 
and to contemplate seriously bringing into the Society, the Corporation, and 
onto the editorial board of the Journal, able and brilliant younger scientists. 
In concluding this address to this distinguished group I propose the question, 
“Whither now, O sacred cow?” 

*The theory, held by many, that the cow is sacred. 
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NEW DEVELOPMENTS IN STATISTICAL BEHAVIOR THEORY: 
DIFFERENTIAL TESTS OF AXIOMS 
FOR ASSOCIATIVE LEARNING 


W. K. Estes 
INDIANA UNIVERSITY 


Several types of new developments in statistical behavior theory deserve 
attention: (i) attempts at relatively direct empirical tests of basic axioms; 
(ii) the application of models that survive these tests for purposes of data 
reduction and analysis; (iii) the extension of extant models, or elaborations 
of them, to new empirical problems; and (iv) the introduction of new models 
based primarily on abstract mathematical considerations. The last two of 
these ‘‘new developments” have been well represented in recent books [3, 11]; 
the first two will be illustrated in the following sections. 


Linear Models 


In all of the earlier work on statistical behavior theory, evidence con- 
cerning the tenability of the various fundamental assumptions has been 
quite indirect. The usual procedure in treating any particular experimental 
situation has been to formulate a model for that situation by appropriate 
interpretation of the general concepts and assumptions, then to generate 
and test predictions flowing from the model. Confirmatory results could 
be taken to support the general approach and the assumptions in combination. 
What has become only gradually apparent in the course of continued analysis, 
however, is the extent to which some of the assumptions of a model can be 
modified while leaving many of the quantitative predictions unchanged. As 
an example of this state of affairs, one may find in a single recent volume 
concerned with mathematical learning theory [3] some half dozen distinct 
models all of which yield the same formula for the predicted mean learning 
curve in a two-choice, probability learning situation. 

One important respect in which these models differ is the type of assump- 
tion made concerning the nature of the learned association between stimulus 
and response. In my own earliest formulations [see 5], the stimulus in, say, 
a paired associate experiment was represented as a finite, but presumably 
large, set of stimulus elements, a portion of which were randomly sampled 
by the subject on each trial. The basic learning assumption was that all 
elements in the trial sample became conditioned to the reinforced response 
(any one element being conditioned to only one response at any one time). 
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Response probability, assumed in the theory to be equal to the proportion 
of stimulus elements conditioned to the given response, would then vary 
over an essentially continuous set of values (actually continuous in Bush 
and Mosteller’s formulation [4]). 

One consequence of this statistical model is that for a group of subjects 
the mean curve of response probability vs. trials should be exponential in 
form with the slope parameter independent of the number of response alter- 
natives. Predictions of this sort have been nicely confirmed [2, 3, 4, 5]. A 
second consequence of this model is that the probability of a correct response 
to a stimulus on the part of any individual subject should increase in a gradual 
fashion as a function of successive reinforcements. Relatively direct tests of 
this latter prediction have been attempted only recently, and the results 
have come as something of a jar to expectations built up during a period of 
preoccupation with group curves. 

The simplest experimental design for this kind of test comprises a single 
reinforced trial followed by two consecutive, unreinforced test trials, which 
may conveniently be symbolized 
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Procedures and results for a number of experiments using this design have been 
reported elsewhere [6, 8]. The main point of interest here will be to examine 
their implications for different types of statistical learning models. 

For simplicity, suppose one chooses a situation in which the probability 
of a correct response is zero before the first reinforcement (and in which the 
likelihood of a subject’s obtaining correct responses by guessing is negligible 
on all trials). We can readily generate predictions for the probabilities, p,;; , 
of various combinations of response 7 on T, and response j on T, from the 
stimulus sampling model sketched above or the (equivalent for present 
purposes) linear model ([3], ch. 8). These predictions are summarized in 
Table 1 together with data from an unpublished experiment conducted in 
the writer’s laboratory with the assistance of Miss Judith Crooks. In the 


TABLE 1 


Observed Response Proportions for RTT Experiment 
together with Predictions from Variants of the Linear Model 
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table, p,, denotes probability of correct responses on both T, and T, , pio 
probability of a correct response on T, followed by an incorrect response 
on T, , and so on. The entries under Simple Linear Model (SLM) are pre- 
dictions from the linear model with 6 as the learning parameter. According to 
this model ((3], ch. 8), probability of a reinforced response changes in ac- 
cordance with the linear difference equation 


(1) Pati = (1 rie 9) Dn + 6. 


In the present application, p> = 0 by hypothesis, so p, , the probability of a 
correct response on T, , is simply equal to 6. No reinforcement is given on T, 
and consequently the probability of a correct response does not change 
between T, and T, ; therefore the probability of correct responses (by a given 
subject to a given stimulus) on both T, and T, is 6°. Other entries under 
SLM are obtained similarly. 

The observed proportions in Table 1 represent data from 40 subjects 
each tested on 15 paired associate items, the stimulus members of the items 
being randomly drawn consonant trigrams and the responses English words. 
The RT,T, design applied to each item. In order to minimize the proba- 
bility of correct responses occurring by guessing on unlearned items, these 
items were introduced, one per trial, into a larger list, the composition of 
which changed from trial to trial in an unpredictable (from the subjects’ 
viewpoint) manner. A critical item introduced on trial n received one rein- 
forcement (paired presentation of stimulus and response members) followed 
by a test (presentation of stimulus member alone) on trial n and then a 
test only on trial m + 1, following which it was dropped from the list. 

No extensive calculations are needed to show that the simple linear 
model cannot handle the data. It suffices to note that the model requires 
Pic = Po. , Whereas the difference between these two entries in the data 
column is large and highly significant. The basic difficulty with the model 
is that if the reinforced trial has produced some degree of learning, in general 
incomplete, for a given subject and item, then the probability of a correct 
response on T, is predicted to be greater than zero (and substantially so, given 
any combination of parameter values that could yield correct response proba- 
bilities of the order of those observed on T,) even in cases where the response 
failed to occur on T, . But in the data, the probability of this sequence of 
events is virtually zero, suggesting that when the correct response fails to 
occur on T, , it is simply because no learning has occurred on the reinforced 
trial. 

One might try to “‘save’’ the linear model by arguing that the pattern 
of observed results in Table 1 could have arisen as an artifact. If there are 
differences in difficulty among items (or, equivalently, differences in learning 
rate among subjects), then the instances of incorrect response on T, would 
predominantly represent smaller @ values than instances of correct responses. 
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On this account one might expect that the predicted proportion of correct 
following incorrect responses would be smaller than that allowed for under 
the ‘‘equal 6” assumption, and therefore that the linear model might not 
actually be incompatible with the data of Table 1. The validity of such an 
argument can be readily ascertained. Suppose that a relatively large value, 
6, , of the learning parameter were associated with a proportion f, of the 
items (or subjects) and a smaller value 6, with the rest. Then in each case 
where @, is applicable, the probability of a one on T, followed by a zero on 
T, is 6, (1 — 6,); in each case where @, is applicable, the same probability is 
6, (1 — 6,). Clearly we would have for such a group of mixed fast and slow 
learners 


Pio = f,4,(1 as 6;) + f262(1 ~~ 62). 


But a similar argument yields also 


Pa = f,(1 — 6:)0, + f.(1 = 92) A. . 


The derivation extends in an obvious manner to the general case of a popula- 
tion consisting of N individuals who can be categorized into any number of 
subsets. The f; individuals in the 7th subset are characterized by a learning 
parameter 6; . Theoretical formulas for the general case are given in the 
third column of Table 1. Since, again, the expressions for pio and po, are 
equal for all choices of parameter values, it is plain that individual differences 


in learning rates could not be responsible for the observed pattern of results. 

A related hypothesis which might seem to merit consideration is that 
of individual differences in rates of forgetting. Since the proportion of correct 
responses on T, is less than that on T, , there is evidently some retention 
loss, and differences among subjects, or items, in susceptibility to this retention 
loss might be a source of bias in the data. The hypothesis can be embodied 
in the linear model as follows. Probability of the correct response on T, is 
equal to 6, and in the preceding derivations we have assumed the same 
probability to obtain on T, . If there is a retention loss, however, probability 
of the correct response on T, would have declined to some new value r, such 
that r is less than @. If there are individual differences in amount of retention 
loss, then we should again conceive the population of subjects and items to 
be categorizable into subgroups, with the f; individuals in the ith subgroup 
characterized by retention parameter r; . Theoretical expressions for the p;; 
can be derived for such a population by the same method used in the preceding 
case; the resulting formulas are given in the fourth column of Table 1. This 
time, the expressions for p,) and po, are different. With a suitable choice of 
parameter values, they could accommodate the difference in the corresponding 
observed proportions. Another difficulty remains, however. To obtain a near 
zero value of po, , we would require either a @ value near unity, which would 
be incompatible with the observed proportion of .385 correct on T, , or a 
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value of >_; f:r;/N near zero, which would be incompatible with the observed 
proportion of .255 correct on T, . Thus we have no support for the hypothesis 
that individual differences in amount of retention loss might be responsible 
for the observed pattern of empirical values. 

A suggestion of quite different type, brought forward by individuals 
attempting to find a reasonable interpretation of the RTT data in terms of 
a linear model, is that learning occurs on test trials as well as on reinforced 
trials. Since in our experiments the subject receives no informative feedback 
of any sort from the experimenter on T, or T, , it is scarcely conceivable that 
selective learning of the correct response could occur. However, on the basis 
of association-by-contiguity [5, 9], or some hypothesis of ‘‘self-reinforcement,”’ 
one might expect that learning in the sense of an increase in strength of what- 
ever response, correct or incorrect, actually occurs on a test trial. In terms 
of the linear model, this would mean that if a correct response occurs on T, 
its probability on T, should be given by 

P= (1 — )p, + 
=(1— 0')6+ 0; 
whereas if an incorrect response occurs on T, , probability of the correct 
response on T, should be given by 
po = (1 — 6’)p, 
= (1 — 6’)6. 
Equation (2) is, of course, just a specialization of (1) with the learning parame- 
ter designated as 6’ rather than @, to allow for the possibility that learning 
rates might be different on reinforced trials and test trials, and with p, set 
equal to 8 as before. To obtain (3), note that if an incorrect response occurs 
on T, , its new probability on T, should be given by (2) with p, and p, re- 
placed by the corresponding probabilities of incorrect responses, (1 — p,) 
and (1 — p,), respectively, viz., 


(1 — p) = (1 — O)(1 — p) + 0, 


(2) 


(3) 


whence 
P= 1— (1 — Ol — p,) — & 
= (1 — 6’)p, 
= (1 — 6’)0. 


Now, multiplying the probability of a one or a zero on T, by the conditional 
probability of a one or @ zero on T, , using expressions (2) and (3), 
we have the expressions for p;; listed in the last column of Table 1. As in 
several of the other variants considered, the theoretical expressions for pio 
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and po, turn out to be identical. Evidently we must conclude that even if 
learning does occur to some degree on test trials, the effects upon response 
shifts from one to zero and zero to one must be symmetrical and therefore 
cannot bring the linear model into line with the data. 

One could go on in similar fashion and examine the results of supple- 
menting the original linear model by hypotheses involving more complex 
combinations or interactions of possible sources of bias. Personally, I do 
not find such a course appealing. If no plausible complicating factor that 
has come to mind accounts for the disparities between theory and data when 
it enters the picture in a simple and direct way, then I suspect that the more 
fruitful course of action is to reject the model (for the given situation) and 
consider revision of the basic assumptions. 

It is worth noting, further, that our findings relative to the linear model 
must hold, not simply for models which represent the effects of reinforcement 
by linear transformations of response probabilities ((4]; [3], ch. 8), but for 
the whole family of “‘incremental’’ models, including those of Hull [10], 
Restle [12], and, most recently, Luce [11]. The critical assumption all of 
these theories have in common is that all subjects characterized by a given 
set of parameter values receive the same increment in probability of the 
reinforced response on a given training trial. Consequently, in seeking a 
more satisfactory interpretation of the RTT experiment, we may find greater 
potentialities in the family of stimulus sampling models, which differ from 
the linear models with respect to precisely this assumption. In sampling 
models, the branching process begins earlier, so that the effect of a single 
reinforced trial is to produce, not a given increment in correct response 
probability for all like subjects, but rather an increment for a (randomly 
selected) subset of individuals and no increment at all for the rest—thus the 
appellation ‘‘all-or-none”’ acquisition. 


Stimulus Sampling Models 


An important term in recent formulations of stimulus sampling theory 
((3], ch. 1; [7]) is the conditioning parameter, herein designated c, which may 
be interpreted as the probability that the previously unconditioned portion 
of the trial sample of stimulus elements becomes conditioned to the experi- 
menter-reinforced response. Theoretical formulas for the p,;; of the RTT 
experiment, derived from a stimulus sampling model with N as the number of 
elements in the stimulus population and s as the number sampled per trial, 
are given in the first column of Table 2. To obtain the first entry, we note 
that the correct response can occur on either test trial only if conditioning 
occurs on the reinforced trial, which has probability c. On occasions when 
conditioning occurs, the whole sample of s elements becomes conditioned to 
the correct response and the probability of this response on each of the test 
trials is s/N. On occasions when conditioning does not occur on the reinforced 
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TABLE 2 
Interpretation of RTT Experiment in Terms of Stimulus Sampling Models 








Fixed s, N Pattern Model Pattern Model 
Model Perfect Retention Imperfect Retention 





pu c s?/N? c cr 
Pro e(s/N\1 — s/N) 0 a1 — 1) 
Poi c(1 — s/N)s/N 0 0 

Poo 1—c+c(1 — s/N)? 1—c 1 —c 





trial, probability of the correct response remains at zero over both test 
trials. As in the case of the linear model, the stimulus sampling model requires 
Pio = Poi , SO no choice of parameter values will permit the model to fit the 
data of Table 1. 

But despite the fact that the unequal observed shift probabilities remain 
unaccounted for, we may have made some progress. A second difficulty 
with the linear model family was that there was no reasonable way to permit 
a near zero value of po; . In the sampling model the situation is different in 
this respect, for when no correct response occurs on T, it is because no con- 
ditioning occurred on the reinforced trial, and therefore no shifts from zero 
to one are predicted. If s = N = 1 in the sampling model, we arrive at the 
special case which has been termed the simple pattern model ((3], pp. 12-38), 
since the one remaining stimulus element must be interpreted, in psychological 
terms, as representing the full pattern of stimulation which recurs at the 
beginning of each trial. As can be seen in the second column of Table 2, there 
is just one important respect in which the simple pattern model falls short 
of being able to fit the pattern of observed p;,;’s, namely in the requirement 
that pio = 0, an outcome that cannot be tolerated by our data (or by the 
data of other RTT experiments [6, 8]). The difficulty with the simple pattern 
model is that it demands perfect retention in the sense that a correct response 
on any test trial must be followed by correct responses to the given item on 
all subsequent tests. . 

Evidently the model must be modified to admit the possibility of some 
retention loss between T, and T, , perhaps because of fluctuations in con- 
textual cues which make the probability less than unity that the stimulus 
pattern effective upon presentation of a given item on T, will be perfectly 
reinstated on T, . Letting r represent the probability that a correct response 
given to any item on T, will be recalled on T, , the ‘‘pattern model with 
imperfect retention’”’ yields the theoretical formulas shown in the third column 
of Table 2. With suitable choices of values for c and r, these formulas can 
generate values very close to the observed proportions of Table 1 (for pi: , 
Pio » Por » ANd Poo , respectively, the theoretical values are .241, .151, .000, 


and .608). 





80 PSYCHOMETRIKA 


This outcome of our series of attempts to handle the RTT data should 
be viewed with some conservatism, but perhaps also with some sanguinity 
for the future. No rigorous test of the model just considered can be obtained 
with two parameters to be estimated and only three degrees of freedom in 
the data. Further, it seems likely on psychological grounds that an adequate 
account of retention losses over more than a single pair of test trials will 
require somewhat more elaborate conceptual apparatus. However, unlike 
the other types of assumptions we have investigated, the combination of 
all-or-none acquisition and imperfect retention appears to offer promise, first, 
for the immediate problem of interpreting the RTT experiment, and, second, 
as the basis for a more satisfactory general theory of associative learning. 

Since our ultimate objective is a theory of some generality, it is of 
course essential to arrange tests of each of the basic assumptions with a 
number of different experimental designs. To illustrate just one alternative 
approach to testing of the all-or-none assumption, we might sketch briefly 
the plan of a study now being conducted by E. J. Crothers in the Indiana 
laboratory. In schematic outline, Crothers’ design takes the following form: 


R, as R, T; 


Type 1 A B e P(C) =? 
Type 2 A B A P(A) =? 


At the conclusion of the T, trial of a paired associate experiment, items 
are assigned to Type 1 and Type 2 in such a way that all irrelevant factors 
are equated. All items selected for these treatments are ones for which some 
response A was reinforced on R, but the subject gave a different (‘‘incorrect’’) 
response on T, . On the second training trial, some new response C is rein- 
forced in the presence of the stimulus member of an item of Type 1, but 
response A is reinforced a second time in the presence of the stimulus member 
of an item of Type 2. It would seem clear that a linear model, or in fact any 
sort of “incremental” model, must predict that P(C), the probability of the 
once reinforced response C’, would be less than P(A), the probability of the 
twice reinforced response A, on T, , whereas according to an all-or-none 
theory (e.g., the simpie pattern model discussed above) these probabilities 
should be equal. It will be interesting to see whether the data generated by 
Crothers’ quite different approach will agree with those of the RTT experi- 
ments in their implications for acquisition theory. 


An Application of the All-or-None Pattern Model 


In this final section, we shall pursue further the possibility that the 
assumptions of the pattern model represent adequately the acquisition phase 
of associative learning. If so, then one might expect that in especially simpli- 
fied situations where the effects of retention loss are minimized, the model 
would provide a quantitative account of the data over the whole course of 
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acquisition. The most important specifications are that the stimuli should 
be distinct, the full set of responses known to the subjects prior to the first 
test trial, and intervals between reinforced trials and test trials short enough 
to make forgetting negligible. It might be feared that the all-or-none pattern 
model, although appealing in its simplicity, will prove too meager in con- 
ceptual resources to deal with even such highly simplified experiments conduc- 
ted with real human subjects. This possible difficulty need not remain a 
matter of speculation, however, for during the last year or two, several 
relevant studies [1, 3, 5, 7] have been reported. 

An interesting example of the indicated type of application is provided 
by a series of studies Professor Patrick Suppes is currently conducting at 
Stanford University, with the assistance of Dr. Rose Ginsberg, concerning 
the learning of mathematical concepts by young childern. Table 3 summarizes 


TABLE 3 
Stimuli and Correct Responses in Suppes and Ginsberg Study 








Stimuli Correct Responses 





A TL I 
| ied 
=X 
ATTA 
r*r 


(=)d (=) 





the design of an experiment dealing with the learning of binary numbers. 
In order to permit analysis of the extent to which the learning involves the 
association of correct responses with abstract patterns, as opposed to com- 
ponent cues, Suppes and Ginsberg used a variety of particular symbols to 
represent each number. Thus the number four, 100 in binary notation, was 
represented by any one of the combinations shown in the first three rows of 
Table 3, and the number five, 101 in binary, by any one of the combinations 
shown in the last three rows. It can be seen that each symbol appears with 
both of the numbers, so learning cannot be based on association of responses 
with individual symbols. The items were presented in random order, under 
a correction procedure, to 24 five-to-six year old children for 96 trials. If 
the children make progress toward 100 percent accuracy in naming these 
sets of symbols, they must learn to associate the correct response either with 
each combination (pattern) as a unit or with the abstract property character- 
izing each set. 

One way of gaining information as to just what and how the children 
are learning over the series of trials can be obtained by entertaining the 
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hypothesis that the correct response becomes associated with each of the 
six stimulus patterns on an all-or-none basis and applying the simple pattern 
model to the data. Specific assumptions involved in this application will 
be as follows. 


1. Prior to learning, the child responds to each stimulus by random 
choice of one of the permissible responses, with probability one-half 
for each. 

. On each trial, the reinforced response becomes conditioned to the 
stimulus pattern present with probability c, the parameter c being 
assumed constant over trials for each child. 

. Once conditioned to a given stimulus pattern, the correct response 
henceforth occurs to that pattern with probability one. 


Using these assumptions, Bower [1] has derived numerous expressions for 
statistics of the data of such an experiment (i.e., for a paired associate experi- 
ment with N distinct stimuli and N or fewer reinforced responses). In Table 4 


TABLE 4 


Statistics of Suppes and Ginsberg Study 
Compared with Values Predicted from “Pattern Model’ 








Observed Theoretical 





Mean errors per item 

8.D. of errors per item 

Mean errors before first success 

S.D. of errors before first success 

Mean number of error runs 

Mean number of alternations of success and failure 
Autocorrelation of errors, lag 1 

Autocorrelation of errors, lag 2 


OO > 


3.13 

.92 
1.30 
2.44 
4.25 
1.94 
1.71 


ZBezesess 
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a number of these statistics for the data of the Suppes and Ginsberg experi- 
ment are compared with the values computed for the theoretical expressions.* 
The first statistic, mean errors per item, was used to estimate the value of 
the parameter c for the group of children; then all remaining theoretical 
values were calculated with this value of c (c = .088) inserted in the ap- 
propriate formulas. Thus for the last seven lines of Table 4, no degrees of 


freedom from the data were used for “curve-fitting,” and the theoretical 
quantities can be regarded as a priori predictions. On the whole, one can 
see that very little of the variance in the observed data remains unaccounted 


*I am indebted to Drs. Suppes and Ginsberg for making available these unpublished 
results of their study. 
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for by the model. Granted that a more elaborate theory will be required to 
handle some aspects of paired associate learning, e.g., the generalization 
phenomena treated by Shepard [13], it would appear that any such theory 
must reduce to the present model or something very like it in numerous 
special cases. 
Appendix 

Following are the formulas used to calculate theoretical expressions for 
the various statistics of the Suppes and Ginsberg study. As in the preceding 
sections, ¢ represents the conditioning parameter for any individual subject. 
It will be convenient to define a random variable x, which equals one or 
zero according as an error or a correct response, respectively, occurs to a 
given item on trial n. Then for an experiment consisting of N trials, with r 
alternative responses, we have the following. 


Mean errors per item: 


bs v,) re (1 * la ti ok, 


n=1 


Variance of errors per item: 


ve(Ex) =(1-fedaat (Jia gage 
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-W= D1 =9"" +0 -9 
Mean errors to first success per item (F): 


eo n r pee 1 
E(F) = »| © Tec — 2.) | ea pie 
Variance of errors to first success: 


Var (F) = E(F)[1 + (1 — 20 E(F)], 


the probability, very small under the conditions of the Suppes and Ginsberg 
study, of no success in the whole series of N = 16 trials being neglected in 
these expressions for H(F) and Var (F). 


Mean number of error runs per item (Ry): 


E(Ry) = ule + Sal *( - a |, 


r 


where u, = (1/c) (1 — 1/r) is the total expected number of errors during 
learning. 
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Mean number of alternations of success and failure per item (Ay): 


E(Ay) = ny [7.1 — 2...) + (1 — rt} 


ule < 20 = 9 |p ~ i 39: 


r 


Autocorrelation of errors, lag 1: 


(> Baas) ‘ u(1 i ae - ofl -(—9*"t. 


n=1 


Autocorrelation of errors, lag 2: 
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COMPUTER MODELS OF COGNITIVE PROCESSES 
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Twenty-five years ago, when the Psychometric Society was formed, the 
new developments that were then shaping the course of psychometric research 
were subjective scaling and factor analysis. Today, a development that seems 
destined to influence psychometric research in the next twenty-five years is 
the simulation of cognitive processes on high-speed digital computers. So 
far, of course, there has been more talk about the potential utility of com- 
puters in psychology than there has been accomplishment; nevertheless, what 
achievements have been made seem to me to imply a bright future. 

Computers have had three main influences on psychometrics. First, the 
immense computational facility of computers has changed our attitude toward 
complex statistical techniques. In place of the helpless sense of frustration 
that used to arise when we learned that a new technique required, say, the 
inverse of a 20 X 20 matrix, or the complete set of latent roots and vectors 
of a matrix, we now feel confident that a digital computer can do the job 
in a few minutes. With a computer, for example, we can think seriously about 
estimating the communalities of a large correlation matrix by the multiple 
correlation of each variable with all the other variables. We may think a 
long time before we do it, but nevertheless we can think seriously about 
such a prospect. 

In the same way that computers have broadened our concept of what 
computations are feasible, they are also broadening our notion of what 
constitutes an acceptable selution to a problem, We used to think that an 
analytic method of rotation in factor analysis, for example, meant a single 
equation or a few equations that gave the entire solution. Any more com- 
plicated process in which tests were selected and planes were put through 
subsets of tests, with a lot of contingent rules depending on what happened, 
did not seem really analytic. The difficulty is that we have been using the 
word analytic when we really mean objective. A procedure can be completely 
objective, no matter how complex, no matter how contingent its rules. If 
the procedure can be programmed for a digital computer, then it is com- 
pletely objective, even though it may not be analytic in the usual sense. 
The emphasis on objectivity, as defined by the existence of a computer pro- 
gram, gives us a much freer rein in devising solutions for problems. 








*Operated with support from:the United States Army, Navy, and Air Force. 
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In the same way that a quantitative procedure can be defined in terms 
of a computer program rather than an equation, objective models of behavior 
can also be computer programs rather than equations. If a model can be 
written down in an objective way, so that it can be programmed for a digital 
computer, then it can be tested. 

The notion that a digital computer program could be a model of human 
behavior is a natural outgrowth of a field of engineering endeavor that has 
come to be called artificial intelligence. Workers in this field are striving 
to develop more powerful machines, primarily by the automation of human 
functions. Automatic visual perceivers, automatic speech recognizers, and 
many kinds of learning and game-playing machines have been built or pro- 
grammed. Most of these automata make no pretense of achieving their 
results in the same way as their human counterparts. A small group of 
people has approached the computer simulation of cognitive processes with 
the specific intent of producing models of behavior rather than just getting 
the appropriate outcome. Newell and Simon have pioneered in this work, 
and Miller, Galanter, and Pribram have written a very stimulating essay 
on the possibility of applying this idea to all areas of psychology [7]. But 
any automaton, whether it is intended to simulate human behavior or just 
to do man-like things, is by definition a model of behavior. If a machine 
accomplishes the same result that a person does, then the machine is mani- 
festly a model of human behavior, as Boring [1] pointed out many years ago. 
In this paper no distinction is made between the simulations that are models 
by design and those that are models by accident. 

Most of the simulations that have been made can be put in one of four 
categories: neural neis, pattern recognizers, problem solvers, and language 
processors. Significant contributions are being made in all these areas. 





Neural Networks 


Most of the existing simulations of neural networks have used random 
interconnections of elements, together with devices for reinforcing some con- 
nections and inhibiting others. The individual neurons were invented many 
years ago by Rashevsky [9], McCulloch and Pitts [6], and others, but the 
computers allowed the construction of huge networks of elements. The 
results of computer simulation of random nets can be summarized easily: 
random nets need very many elements—even then they learn very little 
very slowly. A random net will eventually learn a one-bit discrimination, 
but for more complicated performance, some nonrandomness is needed. In 
fact, the less randomness the better. A celebrated example of the futility 
of randomness is the Perceptron [10], a device that was widely publicized 
as a gadget that had solved the problem of machine pattern recognition. In 
fact, the Perceptron was a random neural net together with an input mosaic. 
The neurons were randomly connected with the cells of the input mosaic, 
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and the output of these neurons were randomly connected to two response 
units, since only a binary decision was desired. 

The results are not extensive, but in general the Perceptron did poorly. 
True, it learned a one-bit discrimination, distinguishing X’s from £E’s, in 
30 to 50 trials, but these results were obtained with a restricted set of inputs, 
more or less centered on the mosaic and more or less the same size. By in- 
serting a preprocessor that centers the figures exactly on the mosaic and 
sets them at exactly the same size, better results can be obtained. Structure 
can also be added and performance improved by connecting the neurons to 
the sensory mosaic in a more regular fashion. In general, the greater the 
departure from randomness, the better the performance. 

It is important to note that the random neural nets require hundreds 
of neural elements to make only a one- or two-bit recognition. The random 
nets do not even appear to be capable of discriminating among the 10 numer- 
als, let alone the 26 letters of the alphabet. It is also important to note that 
my disparaging remarks concern randomness, not nets in general. No one 
can deny that the human nervous system is a network of neurons. What we 
do deny is that the interconnections are random. 


Pattern Recognition 


Pattern recognizers have been much more successful than nervous nets. 
Of course, the Perceptron has been billed as a pattern recognizer, but it is 
not in a class with the pattern recognition devices that process their inputs. 
Pattern recognizers are designed to function in a context where the set of 
alternatives is known—the letters of the alphabet, the phonemes of American 
speech—and the machine’s job is to categorize each particular input as one 
of the known alternatives. 

The simplest pattern recognizers merely compare the raw input with a 
stored ideal version of each alternative—the closest match wins. For example, 
a machine for reading printed numerals compares the particular input with 
a stencil of each numeral—the most closely fitting stencil indicates the 
choice. One problem in such a machine is that some of the stencils are very 
similar to others. Devices for reading printed numbers from checks sur- 
mount this trouble by creating artificial differences in size and shape between 
the numerals to accentuate the differences. 

If there is no opportunity to manipulate the environment, stencils may 
overlap too much. Moreover, stencils for raw inputs become almost useless 
for recognizing human inputs, such as handwritten characters or speech, 
because of the huge individual differences. There is no ideal handlettered A, 
or ideal spoken ‘‘AH”’ that will correspond closely with the products of dif- 
ferent people, or even the same person on different occasions. Some analysis 
of the input is required. 

The strategy that has worked out best in analyzing the inputs has been 
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to choose a set of variables, such as number of straight lines in the case of 
handlettered characters, and to characterize both the inputs and the alter- 
native output categories in terms of these variables, which we may call 
tests. No one test is expected to be sufficient, but the combination of tests 
should be sufficient. Doyle [2] has used this technique in his computer program 
for recognizing sloppy hand-printed characters. The data for analysis were 
obtained by asking visitors to the Lincoln Laboratory to print their names and 
addresses in a series of }-inch squares. The computer quantized each square 
in a 32 X 32 array, and first filled isolated holes and erased isolated dots in 
the characters. Then it applied a series of 29 tests to each character. For 
example, it determined the maximum number of lines crossed by a vertical 
slice through the character, the number of cavities facing left, etc. The 
scoring procedure was developed from a preanalysis of half of the collected 
samples of letters; the machine’s performance was tested on the other half. 
The scoring key was made by simply tabulating the outcome of all tests on 
the known characters in the pretest sample. To test the program, a new 
sample was fed to the computer, and the tests were made. The outcome of 
a test was compared with the tabulation, to find how likely each alternative 
was, given that particular outcome. The outcomes of all tests were combined 
in a simple way to give a set of weights proportional to the likelihood of 
each character, and the most likely character became the machine’s guess. 

Pattern recognition has much in common with psychometric testing. 
In fact the problem of pattern recognition is formally identical with the 
placement problem. The same statistical techniques, including multiple dis- 
criminant functions, are applicable. The most important factor in each 
problem is the selection of useful items or tests. In each case, no single item 
is very effective, while the combined information from many items can be 
very effective. 

Pattern recognition differs from psychometrics in having a more reliable, 
objective criterion. The goal of automatic pattern recognizers is to be at 
least. as good as human recognizers. Doyle’s machine correctly recognized 
about 88 percent of the subset of 10 of the 26 letters, while humans correctly 
recognized about 95 percent of the stimuli that he used. 

Forgie and Forgie [3] used essentially the same procedure, but added 
some sequential decision processes, in an automatic speech recognizer. Work- 
ing with 12 vowel sounds, the Forgie machine recognized about 90 percent 
of the sounds correctly, the sounds having been spoken by 21 different 
speakers. Human listeners correctly recognized 97 percent of the same sounds. 


Problem Solvers 


Problem solving seems a more difficult task for computers than pattern 
recognition. Of course, all computers can solve problems when given very 
explicit instructions. The question is whether they can solve problems with- 
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out such direction. One scheme that people keep pondering is random be- 
havior. Friedberg [4] has shown conclusively that a computer cannot solve 
problems at random. He gave a machine the task of programming a very 
simple computer by choosing instructions at random. The simple computer 
was to produce a single output by operating on a set of two or three inputs, 
a task not very different from adding one and one to get two. After 10,000 
trials, the machine had still not written a proper program. Various attempts 
to modify the random selection scheme, while still leaving it essentially 
random, mostly failed, though in a few cases the machine managed to get 
a program after a few hundred trials. 

The two nonrandom problem solving machines that deserve our atten- 
tion are both theorem provers. Newell, Shaw, and Simon [8] programmed 
their Logic Theorist to prove theorems in mathematical logic; Gelernter 
and his associates have programmed a machine to prove theorems in plane 
geometry. In both cases, the machine has at its disposal a set of known 
theorems and axioms from which to build a logical proof. Both machines 
work backward, from the conclusion to the premises, using a means-end 
analysis. That is, the machine examines the conclusion, and determines one 
or more propositions such that if it could prove the propositions, it could 
reach the final conclusion. Each such proposition becomes a subgoal, and 
the machine then tries to reach one of the subgoals, generating further sub- 
goals, until it finally gets back to a sub-sub-goal that it can prove directly 
from the premises. The main problem with such a machine is to keep it 
from going back too many blind alleys. A machine that examines all possible 
subgoals exhaustively would be exceedingly inefficient. The machine must 
be provided with some rules for picking profitable subgoals. Such rules can- 
not, in general, be any more than rules of thumb—that is, rules that work 
often but not always. Newell and Simon call such rules heuristics. A major 
heuristic in the geometry machine, for example, might be, if triangles exist, 
look for congruent triangles. The major problem in a theorem-proving machine 
is to choose good heuristics. 

The same basic machine, given different heuristics, will behave differently. 
Newell and Simon used essentially this fact in simulating the performance 
of novices proving logic theorems. From the performance of the subject, 
they inferred his particular heuristics, which they then gave to the machine, 
to see how closely the machine would then match the subject’s performance. 
The data are too limited to make any definite conclusion, but the close 
comparison of man and machine in the few cases that have been tried is 


very encouraging. 
Language Interpretation 


The remarkable human ability to produce and react to natural language 
is perhaps best appreciated by those who are trying to provide computers 
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with the same ability. Mechanical translation from one language to another 
is a highly prized goal that is still out of reach, despite popular accounts in 
the newspapers. Passable translations can be made with a simple dictionary 
look-up, plus a collection of routines to choose among alternative defini- 
tions. Many choices, however, are exceedingly difficult to mechanize, and are 
left for a human editor. Such a machine is little more than a giant dictionary. 

Apart from language translations, many efforts are underway in infor- 
mation retrieval and automatic abstracting, the goal being to facilitate 
access to accumulated scientific information. I am currently at work, with 
Alice Wolf, Carol Chomsky, and Ken Laughery, on a program to enable a 
computer to answer questions posed in natural English about data stored in 
the computer. We are working in the context of baseball scores, and are 
trying to get the computer to answer such questions as ‘‘How often did the 
Red Sox beat the Yankees in July?’’ Or, for that matter, ‘‘Did the Red Sox 
beat the Yankees in July?” 

None of this work on automatic language processing is at a stage where 
it can provide models of the human communication process. Nevertheless, 
it is important to psychologists for two reasons. First it demonstrates how 
little we know about the details of human verbal processes. Second, it pro- 
vides our best chance to make a model of such processes. A computer program- 
med to respond to natural English is an excellent operational test for a model. 


Conclusion 


From the small sample of achievements that I have had time to mention, 
we can only conclude that automation is here to stay. Nor is there any doubt 
that more powerful automata will be built. A great many of the “higher” 
human abilities will be given to machines. The great rush to automation 
is sure to stimulate psychologists to learn more about the human symbolic 
processes being mimicked by the machines. And the computers, which are 
the ultimate cause of the feverish scramble toward automation, are providing 
both the framework for describing complex models of behavior and also the 
means for testing these models. With both the means and the motivation at 
hand, psychologists are sure to make rapid progress in understanding complex 
human behavior. 
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About 27 years ago, a small group of students met with Professor Thurs- 
tone in Chicago to discuss methods of encouraging quantitative work in 
psychology. The initial group that was concerned about the slow rate of 
development of quantitative work in psychology included Jack Dunlap, Al 
Kurtz, Marion Richardson, John Stalnaker, G. Frederic Kuder, and Paul 
Horst. They had discussed the problem, had been helped a bit by Donald 
Paterson, and had decided that possibly if a magazine were set up to publish 
quantitative psychological material this would facilitate the development of 
the field. Persons who did good quantitative work, either theoretical or 
experimental, would thus have a forum where it would be accepted because 
it was high quality quantitative work, rather than being rejected because it 
was quantitative and hence “not of too great interest” to the readers. 

It developed after discussion that possibly the best method of supporting 
such a journal would be to have a society which would have this journal as 
its major organ. This was the nucleus of the Psychometric Society and of 
the magazine Psychometrika, a quarterly journal devoted to the development 
of psychology as a quantitative rational science. 

Thus, in March of 1936, Volume 1, Number 1 of Psychometrika was 
issued with Marion Richardson as Managing Editor, and Horst and Thurstone 
as members of the editorial board. From this small beginning with five or 
ten people interested in furthering the development of the field, it is interesting 
to look back now and consider what has happened during the intervening 
25 years. 

Let us look at the state of quantitative rational psychology at that 
time. Thurstone’s work over the preceding ten years, from 1925 to 1935, 
might well be thought of as typifying the field then. He had done some work 
in the area of learning (Thurstone [44, 46]), developing certain learning 
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curves and checking on the fit of these curves to learning data. He had also 
considered some of the typical material in psychophysics, had become some- 
what dissatisfied with the emphasis in psychophysics on measuring brightness 
of lights or heaviness of weights, had thought that it would be tremendously 
more fruitful and interesting to measure the strength of an attitude, the 
beauty of a picture, the degree of preference for a belief, for a nationality, 
or for a political candidate. This was the genesis of Thurstone’s psycho- 
physics—the Law of Comparative Judgment set up to analyze data collected 
by the experimental method of paired comparisons. Later Thurstone initiated 
what Torgerson has termed the Law of Categorical Judgment to deal with 
the data collected by the experimental method of successive intervals. Suc- 
cessive intervals was developed for the situation in which one could not 
reasonably require that the subject make all intervals equal (method of 
“equal-appearing intervals’’) or where there was doubt that he could or 
would do so, even if requested. At this time also, Thurstone [45] had completed 
his beginning text on test theory, a photo-offset version, and had started 
his developments of factor analysis for the further study of mental abilities. 
Thus he had worked in the various areas which today represent the major 
areas in which the quantitative rational approach in psychology has achieved 
the most success. 

It is of interest that Professor Boring [5] in a recent discussion of quanti- 
tative developments in psychology specified four areas that had been particu- 
larly fruitful for such developments. These were psychophysics, learning, 
mental measurements, and reaction time. Thurstone’s work between 1925 
and 1935, as indicated above, dealt with three of these four areas. 

During the subsequent 25 years there has been relatively little quan- 
titative development in the study of reaction time. There has, however, 
been a tremendous growth in psychophysics or psychological scaling, in 
learning, and in mental measurements represented by developments in test 
theory and in factor analysis. As to the work in psychophysics or psychological 
scaling, I shall simply refer to the symposium held this morning as an illu- 
stration of the development in this field over the last 25 years, and will 
consider here in some detail Learning, Test Theory, and Factor Analysis. 

In order to set the stage for the discussion here I should like to illustrate 
one view of the relationship between scientific theory, mathematics and 
statistics (Gulliksen [18]). One always, of course, initially has the psycho- 
logically meaningful verbal statements of the postulates, the basic assumptions 
of any system. The characteristic thing about the mathematical rational 
approach is that at a very early stage these postulates, that is, the functioning 
postulates that would have some impact on deducing the nature of experi- 
mental results, are translated into the language of mathematics. We then 
have the stage of mathematical development of the concepts eventuating 
in various equations some of which contain two or more terms that can be 
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Figure 1 
Mathematical Formulation of Psychological Theories 


subject to experimental observation. These then may be termed the observa- 
tion equations for which one can gather data. One then designs an experiment 
and collects data from the experiment and then (with statistics) checks on 
the degree of agreement between the observation equation and the data. 
Frequently when one speaks of quantitative methods in psychology, he is 
thinking only of the use of statistics to check on the agreement between a 
hypothesis and data. 

In this discussion I will not deal with statistics which is essentially 
the last step in the development. I will discuss the complex indicated by the 
verbal psychological statements of the postulates, the mathematical state- 
ments of these same postulates, and the derivations from which one gets 
various implications of the initial postulates eventuating then in mathe- 
matical equations that could be in agreement with data from experiments 
or that could be in disagreement with data. 

Statistics (the estimation procedures, testing of hypotheses, and the 
determination of confidence intervals) is a field that has undergone such 
tremendous developments in the last 25 years that again it could not possibly 
be covered even in-a symposium devoted entirely to statistics. 

Omitting both Psychophysics and Statistics is reminiscent of Sherlock 
Holmes in “The Adventure of Silver Blaze.’’ When asked for the most 
significant item in the case to date, he said, ‘‘The strange behavior of the 
dog in the nighttime.’’ Watson, after thinking a moment, replied, ‘“But the 
dog did nothing in the nighttime.” ‘‘That,’”’ said Holmes, ‘“‘is the strange 
behavior.” 

In the consideration of developments in the last 25 years, in a single 
symposium, it is necessarily true that the most significant items in develop- 
ment are those that are being omitted because they are too extensive to deal 
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with short of several symposia. Areas essentially nonexistent 25 years ago 
are now too extensive to be considered in a single session. 


Learning 


One area in which there has been considerable development of mathe- 
matical translation of verbal postulates and derivation of their consequences 
is the area of learning (Hilgard [21]). Thurstone [44, 46] in the early 30's 
developed a theory based on an analogy of sampling from an urn, and showed 
that the equations derived from such assumptions were in reasonable agree- 
ment with data. Since then there have been a number of learning theories 
stated in mathematical form. Gulliksen [16, 17] has generalized Thurstone’s 
initial equations and developed others based directly on Thorndike’s law of 
effect showing that these equations are identical with those that- Thurstone 
developed in terms of an urn model. Rashevsky [35] has taken an approach 
from basic ideas of the functioning of the nervous system, utilizing inhibition 
and facilitation, and has developed some equations of learning on this basis. 

Hull [24] has utilized as his starting point the conditioning model where 
each repetition has an effect of increasing the strength of the response. He 
also used the concept of confusability of various responses to account for 
the lack of, shall we say, immediate learning to explain different degrees of 
difficulty of learning in serial lists. The probabilistic model that expresses 
its postulates in terms of operators increasing and decreasing the probabilities 
of response has also been developed during this time (Bush and Mosteller [7]). 

I should also mention the work of Audley [3] in London. He has developed 
probabilistic equations of learning and devised methods of fitting these to 
individual learning curves so that one can obtain parameters for each indi- 
vidual from data on learning curves and also from data on changes in reaction 
time with learning. This is a rather interesting development, first, because 
it develops the probabilistic model so that parameters can be computed for 
each individual, and, second, because it relates the right-wrong response 
data to the reaction time data. One of the characteristics of learning is that 
the reaction time usually decreases. This theory tries to show that these 
two curves are two different manifestations of the same basic set of parameters. 
Roger Shepard [39] has related work in learning to psychophysics showing 
that generalization in learning is related to psychological similarity. 

There have also been some recent interesting attempts to develop these 
models of learning and to express them in terms of electronic computing 
machine programs where the machine is instructed to compute probabilities 
in accordance with the numbers in certain cells. Under reward conditions 
it adds something to the numbers in those cells, under punishment conditions 
it subtracts something. The information processing language (described by 
Green [14]) developed by Newell, Shaw, and Simon is an illustration of this 
particular approach (see also Newell and Simon [34]). Also Block, Rosenblatt, 
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and others at Cornell have been working on the perceptron (see Rosenblatt 
[37]). This is a mechanical gadget in which the initial connections are purely 
random. However, there is a programming of an increase and decrease in 
resistance of certain circuits corresponding to reward and punishment and 
it turns out that this machine with purely random connections is capable 
of learning. Other discussions of complex behavior of computers are found 
in the Western Joint Computer Conference proceedings [53], the Teddington 
National Physical Laboratory symposium [33], Hagensick [20], Shannon and 
McCarthy [38], and Uhr [51]. 

A very interesting thing to note as one surveys these various theories 
by Audley, Estes, Bush, Mosteller, Hull, Rashevsky, Thorndike, Gulliksen, 
and Thurstone is the essential similarity in the basic framework of each theory. 
This can be indicated as follows. 


1. There is some procedure to effect the ‘‘stamping in,” the ‘‘facilita- 
tion,’’ or the “increase in probability”’ of a response that in some sense is a 
correct response, a rewarded response, or a response that is at least domi- 
nantly rewarded. 

2. There is a corresponding postulate regarding the ‘stamping out,” 
‘inhibition,’ or ‘decrease in probability”’ of a response that may be thought 
of as a wrong response, an incorrect response, an unrewarded response, or 
at least a dominantly nonrewarded response. 

3. Many of the theories also have some provision regarding resem- 
blance or similarity of stimuli either in their sensory characvtistics or in 
their position, such as position near to each other in a rote learning series. 
This sort of similarity or contiguity leads in certain contexts to confusion 
and slows up learning; in other contexts it is termed ‘‘generalization of 
response to similar stimuli,” or ‘‘transfer of training,” or ‘‘equivalence of 
stimuli.’”” Some mechanism, in other words, whereby a response which has 
initially been learned to one stimulus tends to be given to other stimuli. 
Depending on the particular learning set-up designed by the experimenter 
this tendency may either delay learning in one situation, or facilitate genera- 
lization in another situation. 

4. There is also some sort of decrease in probability or fading out 
of a response, ‘forgetting’ due either to passage of time or due to confusion 
with other stimuli. In some guises it has been termed retroactive inhibition. 
Rashevsky has shown how a differential decline rate for inhibition and 
facilitation could produce a “reminiscence” effect. This decline with time 
again enters into a number of the different learning theories. 

5. There is also a change in reaction time that is often made a part 
of the theory. Hull utilized this as one of his postulates. One of the mani- 
festations of learning is a decrease in response latency. Audley [3] has also 
used this to give a very interesting possibility for a sort of reliability check 
on a single learning situation. 
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During the last 25 years we have had a reasonable proliferation of 
slight variants on the increase or decrease of strengths and probabilities. 
These various sets of postulates result in somewhat different observation 
equations. However, the basic observation equations would all be in a super- 
ficial sense fairly similar so that it would probably take rather a precise test 
of a fit in various experiments to determine that one of these theories was 
a better fit to the data than others. Bush and his co-workers at Pennsylvania 
are embarking on such a program now. It is to be hoped that others will 
follow and that in the next 25 years we will be able to specify more accurately 
the kind of learning situation for which a given model or equation is most 
appropriate. 














Reliability of Learning Parameters 





I want to mention here a development that is not, strictly speaking, 
quantitative, but one that may have a tremendous influence in the quan- 
sitative development and testing of learning theory. This stems from the 
work of Sperry [40]. He has found it possible to divide a brain into two halves 
by sectioning the corpus callosum and the optic chiasma; he reports that 
not only is it found that habits learned by one half do not transfer to the 
other half, but for a given animal the peculiarities manifested by him in 
“right brain learning’’ are again exhibited in “‘left brain learning.’’ Should this 
turn out to be verified, or generally true, we now have a possibility never before 
envisaged by workers in the field of learning, the “split brain reliability.” 

In my opinion one of the great handicaps under which work in learning 
has labored over the last hundred years has been the fact that unlike the 
mental test area, it has been essentially impossible to do a repeat experi- 
ment and to determine a reliability. Every respectable achievement or 
aptitude test has some device of odd-even, first and second half, or repeat 
test, whereby one attempts to do the same thing twice and measures the 
accuracy of the technique by the correlation between these two halves— 
the reliability coefficient. In the case of learning the experimenter could 
always obtain a learning curve to determine parameters. However, when 
he attempted to get another learning curve, there was always a dilemma. 
He could experiment on animals which had not been used for the first set 
of learning curves, in which case there was simply a sort of species reliability. 
It would be considered extremely poor procedure, in. the case of an intelligence 
test, to correlate one person’s score with another person’s score in order to 
determine the test reliability. Or he could have the same subjects learn another 
problem, in which case there was always the question, ‘‘Was the subject 
learning the second problem better because of the influence of the first one, 
or was he hindered in his learning of the second problem because of the 
influence of the first one?’’ The experimenter could never be particularly 
certain which was the case and, as a result, measures of learning have not 
had reliability coefficients attached. One just does not know the extent to 
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which the lack of agreement is a result of a difference in the psychological 
function being tested, the psychological ability being tested, or simply the 
result of poor experimental techniques. Certainly this contribution of Sperry 
and others is worth an extremely careful look to see if the initial possibility 
that it holds for “split brain reliability’’ coefficients in the case of learning 
tasks is really borne out. 


Relation of Intelligence to Learning 


I should also like to emphasize that while one purpose of learning theories 
is, of course, to describe the course of learning, this in itself should not stand 
as a final goal. Important questions can be raised regarding the relationship 
of these learning parameters to other parameters characterizing behavior 
of the individual. I can illustrate this point with the studies by Stake [41] 
and Allison [1]. They both have raised a question regarding the relationship 
between mental abilities and learning. As we know, for decades intelligence 
has been defined as the ability to learn, yet intelligence tests have measured 
the ability to learn not directly but only by inference. They have concen- 
trated on what has already been learned. Both Stake and Allison have set 
up a variety of learning problems, have fitted equations of the learning curve 
to the data obtained from 200 or 300 persons who took these learning tests, 
have also given these people some 30 or 40 aptitude and achievement tests 
and then have entered the entire material into a factor study. The purpose 
of these studies is to determine how many different learning abilities there 
are, and to see how these learning abilities are related to the abilities measured 
by aptitude and achievement tests. 

First we can say that, as a result of these two studies, the learning area 
is definitely a complex area that cannot be represented in terms of one 
learning ability. There are many different kinds of learning ability—how 
many we will not know until a good many more studies have been made. 
Second, it is clear that some of the abilities required for the learning tasks 
are not represented in any of the intelligence measures. The nature and the 
importance of these abilities that have been missed by the one-shot aptitude 
and achievement measures constitutes a very important problem for further 
investigation. 

I should also indicate that studies such as Stake’s and Allison’s could 
not have been zonducted without electronic computers. Stake estimated that 
by Monroe-Marchant methods in use a few years ago his analysis would 
have taken one hundred and twelve man-years. With electronic computers 
the job was done in about six months. 


Master- or Reference-Learning Curves 


In the first volume of Psychometrika, Eckart and Young [11] published 
a very important paper. It dealt with the approximation of one matrix by 
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another of lower rank. It applied in general to any matrix, square or rec- 
tangular, and furnished the essential basis for the use of matrix theory for 
expressing and testing a large number of quite different psychological hypothe- 
ses. (See also Hohn [22] for an elementary treatment of matrices.) 

One interesting application of the Eckart-Young theorem is to Jearning 
matrices. For many years, people have analyzed group learning data, plotted 
group learning curves, and criticized others on the ground that there are 
individual differences in learning which averages ignore. The Eckart-Young 
procedure has been used by Tucker [48, 50] for analyzing learning data. 
The matrix of trials by individuals is factored to give a minimum number of 
“‘reference’”’ or ‘‘master’’ learning curves. Each individual receives a set of 
weights indicating the extent to which he has utilized each curve. If the 
matrix is rank one, then there is only one master learning curve, and the 
average curve is a good representation for each individual. In general, for 
ranks greater than one the individuals will not be correctly represented by 
the average curve. 

Tucker [48, 50] has applied this method of handling learning matrices 
to some probability learning data collected by R. Allen Gardner. He finds 
that in a simple probability learning situation where the subject is distin- 
guishing between probabilities of .70 and .30, the matrix is of rank one. 
Only one learning curve is necessary to explain the data. In another situation, 
where four objects were presented with relative frequencies 70, 10, 10, and 
10 percent, three different learning curves were needed to explain the data. 
There were apparently (shall we say) early learners, medium learners, and 
people who caught on to some of the ideas very late in the series of trials, 
so that one of the learning curves was a rapidly rising negatively accelerated 
curve, and the other two were inflected S-shaped curves. The different 
subjects had different weighted combinations of these curves. 

Weitzman [52] has utilized the Eckart-Young procedure for analyzing 
matrices of learning data (animals by trial matrices) for a combined group 
of rats and a group of fish, putting them together as successive rows of the 
same matrix and applying a uniform analysis. The question is, ‘‘Will the 
learning curves that are necessary for the rats be the same as those that are 
exhibited by the fish, and will the weights of the learning curves needed for 
the rats be the same as or different from the weights needed for the fish?’ 
In his particular case he found a rather clear-cut rank-two structure which 
means that the same two, shall we say, master learning curves were necessary 
to explain the learning data for the rats and for the fish. 


Test Theory 


The area of mental measurement, which in the 30’s was represented by 
Thurstone’s [45] small photo-offset manual, now covers a huge literature 
(Anastasi [2], Cronbach [9], Guilford [15], Thorndike and Hagen [43], Lindquist 
[27], Remmers and Gage [36], and Meehl [31)). 
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Reliability and error of measurement are no longer the simple concepts 
they were 25 years ago (Cureton [10], Jackson and Ferguson [25]). Guttman 
[19] has developed formulas for lower bounds of reliability coefficients. 
Cronbach [8] has suggested many different kinds of reliability coefficients 
taking account of various types and combinations of factors which can affect 
test performance. Perhaps one generalization would be to point out that 
there are k different factors which may influence test performance such as 
fatigue, practice, additional learning, time of day, state of health, emotions, 
distractions, maturation, and growth. There are then 2* different reliability 
coefficients, depending on which particular set of factors is of interest for 
the particular use to be made of the test. The more important ones have 
been explicitly dealt with by Cronbach, Guttman, and others. 

Error of measurement is no longer a single number to attach to a test 
to represent variance of observed test scores for persons with the same true 
score. The error of measurement is a function of true score, so that the 
discriminating power of the test will be different at different ability levels. 
Mollenkopf [32] initiated some work in this area. The problem is being 
studied in greater detail by Birnbaum [4] and Lord [30]. The goal of this 
work would be to develop procedures so that it would be possible to specify 
the discriminating power desired in various ability ranges, and then to 
construct a test having the desired characteristics. 

The personnel classification problem is the problem of assigning or 
recommending the most efficient utilization of each person in a group to 
perform the set of jobs to be done by that group. Votaw, Brogden [6], and 
others have suggested solutions for the problem. 

The central problem of test theory is the relation between the ability 
of the individual and his observed score on the test. A third concept, that of 
the true score of an individual on a test, has also been introduced in an effort 
to clarify the problem. Psychologists are essentially in the position of Plato’s 
dwellers in the cave. They can know ability levels only through the shadows 
(the observed test scores) cast on the wall at the back of the cave. The problem 
is how to make most effective use of these shadows (the observed test scores) 
in order to determine the nature of reality (ability) which we can know only 
through these shadows. Birnbaum [4], with his studies of test theory, and 
Lazarsfeld [26], with his use of various trace lines in latent structure analysis, 
have proposed various types of solutions to this problem. 

An attempt to develop a consistent theory tying test scores to the 
abilities measured is typified by Lord’s recent work [28], including his Psycho- 
metric Society presidential address [30], in which he formulated at least 
five different theories of the relationship between test scores and abilities, 
and showed how it was possible to test certain ones of these. It is to be hoped 
that during the next 10 or 20 years a number of these tests will be carried 
out so that we will have not five different theories of the relationship between 
ability and test score and various possible trace lines, but we will be able to 
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say that, for certain specified tests constructed in this way, here is the relation- 
ship between the score and the ability measured, and this is the appropriate 
trace line to use. 


Factor Analysis 


Another one of the major developments over the last 25 years has 
stemmed from the work in factor analysis of mental tests. It is interesting 
to note that when Thurstone worked for the military during the first world 
war, the contribution of psychologists under Dr. Yerkes was to set up a 
single measure of ability, the Army Alpha, or a measure of lower level ability, 
the Army Beta, and to range all men along the single scale of the Army 
Alpha test and on the strength of this information to assign jobs. 

I remember in teaching beginning psychology classes in the late 20’s 
that I repeatedly explained to doubting freshmen that it was merely a 
popular superstition that some people had high verbal ability and others 
had high mathematical ability. These various abilities were perhaps matters 
of differential interest, but basically there was only one intelligence as indi- 
cated by the Spearman so-called two-factor theory, which of course was 
one general factor with various sorts of specific factors, and that any belief 
in various factors had the status purely of an unverified popular superstition. 

In the early 30’s Thurstone took the view that very possibly we had 
failed to find different types of intelligence simply because we had not looked 
carefully enough with sufficiently powerful methods. He developed the factor 
methods, found that there was a mathematics—the mathematics of matrix 
theory—that was possibly relevant, and devoted his time to studying this 
and applying it in the analysis of mental abilities. I remember Thurstone 
telling that he had presented his factor problem to some of the mathema- 
ticians at a Quadrangle Club lunch one noon, pointing out that he had a 
square array of numbers here (the set of correlation coefficients), that he 
wanted to get one rectangular array such that when multiplied together in 
a certain way the sum products of the numbers in these two rectangular 
arrays would equal the correlations in the one larger square array. He said 
they smiled at each other and said, ‘‘Oh, the square root of a matrix is all 
that is.’’ He insisted on pursuing the inquiry further, found that there was 
a field that possibly dealt with this topic that he should be interested in, 
tutored in it for some years, and developed as a result the vectors of mind and 
multiple factor analysis. Tremendous numbers of studies stemmed from this 
work. Other theoretical developments in the area were made by Truman 
Kelley and Harold Hotelling, who also generalized Spearman’s one general- 
factor view to include the possibility of a large number of factors. This was 
the beginning of literally hundreds of factor studies which led to the develop- 
ment of a variety of tests of various mental abilities. One illustration of the 
impact of this work is the difference in the testing program in the second 
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world war. None of the services utilized oniy a single measure of general 
intelligence. There were tests of a variety of abilities—verbal, quantitative, 
spatial, mechanical. Placement for different types of assignments was de- 
pendent on different weighted combinations cf these abilities. 


Theory of Factor Analysis 


With respect to the theoretical developments in factor analysis, we have 
had a considerable growth in the area of statistical tests for significance of 
factors or of ranks of matrices, although considerable still remains to be 
done in this area. The development of methods of comparing factor analyses 
results of one battery with those of another—the interbattery method— 
constitutes an extremely significant contribution (Tucker [49]). The other 
lack, until recently, was the lack of methods for comparing one study on 
a given set of tests with another study using the same set of tests on a different 
sample of people (Tucker [47]). So we now have precise methods for comparing 
different groups given the same battery, and different batteries given to the 
same group. These are powerful extensions of the factor method. 

The recent development of high-speed computing methods is also critical 
for this field. Twenty years ago there was a considerable argument between 
persons with a mathematical bent, such as Hotelling, who insisted that one 
must use the principal axis solution, and experimenters, such as Thurstone, 
who maintained that, while the principal axis solution was very nice, he had 
never seen anyone utilize it with 50 tests on 200 or 300 people. We now of 
course have computing routines that give the principal axis solution at a 
feasible time and cost so that this controversy is now technologically obsolete. 
Thurstone would clearly have adopted the principal axis solution as soon as it 
was feasible from the point of view of cost involved and time consumed. 

Many of the problems in test theory and factor analysis are essentially 
problems of multivariate analysis in mathematical statistics. It is very en- 
couraging to note that many psychologists are developing proficiency in 
mathematical statistics, and also that mathematical statisticians, such as 
T. W. Anderson, Frederick Mosteller, David Votaw, Allan Birnbaum, D. N. 
Lawley, M. G. Kendall, 8. S. Wilks, John Tukey, and others, are becoming 
interested in some of the statistical problems associated with test theory and 
other branches of psychology, and are providing the psychologists with 
solutions to these problems. 


Applications of Factor Analysis 


There have been various conferences on factor analysis and its results 
lately. Two monographs by French [12, 13] on the various achievement and 
aptitude factors and the various personality factors indicate the degree to 
which this field has proliferated. The need now seems to be for more systema- 
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tization, boiling down, determining which of the factors are important and 
which are not, rather than added proliferation of the factors. 

Typically, the work in factor analysis has dealt with a battery of pre- 
dictors. However, increasing attention is being directed toward the problem 
of using a battery for efficient prediction, differential prediction, of multiple 
criteria. The Psychological Corporation has a differential prediction battery. 
Horst [23] at the University of Washington has been developing the theory 
for differential prediction, and developing such a battery. 


Achievement Tests 


I probably should also mention that the field of achievement testing 
has developed considerably since the early 1900’s, when three-hour essay 
examination graded by crews of readers was the standard procedure for 
the College Entrance Examination Board. There is some appreciation of the 
fact that evaluation of the essay is not very precise, and that teachers need 
to be taught the appropriate methods for preparing and evaluating classroom 
tests. This is an extremely large job on which only a relatively small start 
has been made as of now. In the next 25 years I would hope for considerably 
greater sophistication of the classroom teacher in the development and 
evaluation of tests than we find now. 


Summary 


We have considered developments over the last 25 years in the area of 
measurement of mental abilities. Marked advances have been made in 
determining the relationship between the ability measured and the test score, 
in methods of item analysis, in the differentiation and classification of various 
methods of dealing with reliability. The big development in this area though 
has been the change from the emphasis on a single general intelligence to 
the differentiation of a large number of different aptitudes. This has been 
made possible by the development of the factor analysis methods. 

Note that factor methods were just at their beginning when Psychometrika 
was started, that the initial paper by Young and Householder on multi- 
dimensional scaling techniques had not yet been written, the Eckart-Young 
paper dealing with the expression of one matrix as a product of two other 
matrices of minimum rank, a fundamental factor analysis theorem, had not 
yet been written, and that the factor computations were done entirely with 
Monroe-Marchant methods. We can see that during the last 25 years there 
has been, first, a terrific growth in the basic theory related to mathematical 
formulation of psychological problems—basic theory in the area of testing, in 
the area of aptitude measurement and factor analysis, in the area of learning, 
and in the area of psychophysics. Second, there has been a tremendous 
development of computational methods, enabling us to do studies now that 
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were essentially impossible because of time and cost factors even five or 
ten years ago. 

The findings resulting from these methods have an impact in various 
areas. The development of multiple factor tests has changed the entire 
picture of the testing field from what it was during the first world war. 
The development of a variety of learning theories gives some promise that 
in the next 25 years we will be able to specify the types of conditions, if 
any, under which these various theoretical approaches are appropriate. 

The development of the unidimensional and multidimensional scaling 
methods and their use in a variety of areas, in measuring sensations, in 
measuring preferences or values for objects, should have considerable impact. 
Various fields such as linguistics, sociology, and economics should benefit 
tremendously from some of these methods that have been developed during 
the last 25 years since this small group of students met with Thurstone and 
decided to form the Psychometric Society to publish Psychometrika, and to 
further the development of psychology as a quantitative rational science. 
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When writing this paper as well as during our commemorative luncheon, 
I could not help going ‘back in memory to two experiences. One of these 
occurred during the fall semester of 1935 while I was visiting professor 
at Northwestern University and took advantage of the opportunity and 
had the rare privilege of attending Thurstone’s seminar. The atmosphere 
at the University of Chicago at that time clearly reflected the birth and early 
infancy of both the Psychometric Society and its Journal Psychometrika. 

The other experience comes back in the form of an image of our first 
annual dinner in the Inn at Dartmouth College in September 1936. The 
banquet had an unexpectedly large and enthusiastic attendance. Thurstone 
provided one of his memorable addresses, which was later published in 
Science, entitled ‘‘Psychology as a quantitative rational science’ [35]. The 
address is well worth rereading. 

I might spend the time alloted to me in declaiming the merits of quanti- 
tative psychology and what it has done for the progress of psychology in 
general. But in this group I shall take our substantial contribution for granted 
and try instead to help measurement psychologists look at themselves and 
their own progress and some of their own problems. 

Briefly, my comments are based upon a time plan. Taking a very quick 
look at things as they were in psychological measurement 25 years ago, I 
shall take you on a U-2 survey, from very high altitude, of developments 
since that time. I shall pause on some of the livelier issues currently of in- 
terest, view with alarm some of the less pretty aspects of quantitative psy- 
chology, and finally say in what directions future developments might be 
pressed. 

Psychological measurement of 1935 was very much centered around 
one man. It is no wonder that L. L. Thurstone should dominate the scene 
at the initiation of our Society and of its Journal. He had just previously 
developed new psychophysical theory, putting new life into a very old subject 
and broadening its horizons considerably. He had developed his form of 
attitude scale, giving social psychologists new instruments of investigation. 
He had developed methods of absolute scaling of test items and in many 
other ways had put testing on a more completely rational basis. He was in 


109 














110 PSYCHOMETRIKA 


the midst of developing his multiple-factor theory and his centroid method 
with rotations, both of which have become standard procedure in most places 
where factor analysis is done. 


Trends Since 1935 


With this very inadequate description of initial status, let us attempt 
to see what has happened to our field in the intervening years. A general 
picture can be obtained by considering the articles that have appeared on 
quantitative subjects over the years. I have made no attempt to run down 
all articles pertaining to measurement. The lazy man’s way out was taken 
in the form of examination of three journals all of which specialize in quanti- 
tative articles—Psychometrika, Educational and Psychological Measurement, 
and The British Journal of Statistical Psychology. 


Trends indicated by ‘Psychometrika’ articles 


No single journal could give us a better picture of periodical publication 
on quantitative psychology than Psychometrika. There is, of course, a special 
interest in knowing about the contents of the journal whose birthday we 
celebrate. Those who were close to the founding of Psychometrika and who 
perhaps had a hand in shaping the statement of its objective, are naturally 
interested in knowing how well subsequent history has lived up to that 
objective. That objective was stated as follows (it bears repeating): a journal 
devoted to the development of psychology as a quantitative rational science. 

I have gone through the first 24 volumes of Psychometrika, the 25th 
being only partially complete, categorizing the articles and determining what 
proportion of the total numbers published in each five-year period (or pentad), 
also in all volumes taken together, is devoted to each kind of article. I do 
not claim to have the best possible set of categories nor do I pretend that the 
articles are classified with complete accuracy. The results are nevertheless 
informative. 

Figure 1 shows graphically the categories and the proportions of totals 
that were devoted to each topic. A list of over-all proportions is also shown. 
I shall mention the topics in descending order of their proportions, but you 
will find them in ascending order in Figure 1. 

Articles on factor analysis have taken up about 30 percent of the space, 
in terms of number of articles—almost a third of the total space. A sub- 
division of these articles as between articles on theory and methods and 
those on factor-analytical results shows that the great majority have been 
on theory and methods. The total proportion devoted to reporting results 
of factor analyses has dwindled to almost nothing in the last four years, in 
accordance with a current policy of Psychometrika not to publish articles of 
that nature. 

Space devoted to test theory and methods increased from a relatively 
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FiGureE 1 


Proportions of the articles in Psychometrika devoted to various general areas during its 
first 24 years. 


small beginning and has maintained its large share of space during the last 
twenty years. Included under test methods are treatments of the personnel- 
classification problem. Many articles have been on reliability. 

Articles on general statistical methods have shown a marked increase 
then a decrease over the years, with maximum publication in the third 
pentad. The latter fact may be attributed to the enlarged recognition of 
analysis of variance and related methods. 

Under mathematical models are included theoretical mathematical treat- 
ment of problems in several fields: learning (3.7 percent), social behavior 
(3.5 percent), neurology (2.6 percent), and a group of articles that pertain 
to what is sometimes called finite mathematics, including information theory, 
decision theory, and theory of games. The articles on mathematical models 
started with a relatively large number, mostly from Rashevsky and his 
students, on the subjects of biophysics and social behavior. The proportion 
dwindled to a low of 3.4 percent in the third pentad. Growth since that time 
can be attributed largely to applications of finite mathematics. 

The mathematical-model group seems to represent most clearly the kind 
of developments fitting the objective of Psychometrika. We should not over- 
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look the fact, however, that much of the publication on factor theory and 
methods and on test theory and methods also represents mathematical-model 
building. We should also not overlook the fact that more such articles are 
appearing in the Psychological Review and other journals then formerly. 

Psychophysics and scaling articles have been mostly on scaling methods 
and theory. The relative spurt during the past four years has been devoted 
largely to the methods of successive categories and pair comparisons. There 
is thus a current lively interest in scaling subjects. It should also be mentioned 
that such articles appear scattered in other journals. 

The “other-subjects’’ category is the inevitable miscellaneous group. 
It includes articles of general psychological theory and also many of the 
presidential addresses, which are often unclassifiable, although they of course 
pertain to psychological measurement in some way. 

Everything considered, the data presented in Fig. 1 indicate that Psycho- 
metrika has lived up to its stated objective quite well. Its founders and its 
Editorial Board can feel much satisfaction in this regard. 


Educational and Psychological Measurement 


The stated objectives of EPM emphasize tests, their development and 
use, but also mention measurement methods in general. Starting five years 
later than Psychometrika, Educational and Psychological Measurement now 
extends over almost four pentads. In classifying articles appearing in this 
journal, I have omitted those in the supplements, since they obviously repre- 
sent a biased sampling, on the subjects of counseling and guidance. A rather 
different set of categories is needed as compared with those applying to 
Psychometrika. The results are shown in Figure 2.* 

About a third of the total space has been devoted to general statistical 
methods, test statistics and test development, and other measurement 
methods, including scaling methods. About half of the space in Psychometrika 
has been devoted to those subjects. The space devoted to those subjects 
in EPM has been growing decisively, from about 15 percent in the first 
pentad to more than 40 percent in the last two pentads. 

Approximately 8 percent of EPM’s space has been devoted to uses of 
tests and testing programs, primarily in education and in the civil services. 
About 19 percent has been devoted to particular tests, their development, 
their reliability and validity, and experimental work with them. The space 
devoted to test uses and test programs has dwindled progressively to a 
negligible quantity during the last pentad. 

About 9 percent of the space of EPM has been devoted to articles on 
counseling and guidance. This proportion has been rapidly decreasing during 
the past ten years, but real trends in this instance are somewhat obscured 


*Mann [24] has made a more detailed classification of most of the same articles. 
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Figure 2 


Proportions of the articles in Educational and Psychological Measurement devoted to various 
general areas during its first 19 years. 


by the fact that many such contributions probably went into the supplements. 

The space devoted to factor analysis, mostly on results, has always 
been small, amounting to 7 percent. The notable increase during the past 
four years reflects the decision of Psychometrika not to publish such articles 
and a consequent shifting from the one journal to the other. Educational 
and Psychological Measurement, too, seems to have served its objectives well, 
and it has been a good supplement to Psychometrika, the one journal empha- 
sizing theory and other practice. 


The British Journal of Statistical Psychology 


The British Journal of Statistical Psychology should be mentioned as a 
counterpart of Psychometrika, which it most resembles. Since the British 
journal has been in existence a little more than ten years, no study of trends 
is meaningful. Classifying its articles (omitting one-page notes and letters) 
in the same categories as were used with Psychometrika, we find the following 
percentages. 





PSYCHOMETRIKA 


Factor analysis 


Test theory and methods 
General statistical methods 
Mathematical models 
Psychophysics and scaling 


The “other’’ category includes a relatively large proportion of the articles, 
many of which are on particular tests and on problems of heredity, subjects 
that are very rare in Psychometrika. 

The British journal devotes a relatively iarger proportion of its space 
to factor analysis and to subjects just mentioned. It devotes relatively smaller 
proportions to the subjects of general statistical methods, psychophysics 
and scaling, and mathematical models. Taken altogether, the three journals 
I have mentioned are alike in devoting the majority of their spaces to tests 
and factor analysis. 


Some Active Issues and Developments 


Let us take a closer look at some of the more lively areas of investigation 
and controversy that are in the forefront at the present time. A hundred 
years after Fechner, Stevens [30] has brought to the fore the question of 
psychophysical laws. Related to this issue are the obvious discrepancies in 
scaling results as obtained from different principles of scaling—scales as 
derived from comparative vs. category vs. ratio judgments. A possible 
reconciling principle, the concept of adaptation level, introduced by Helson 
[15] and developed in collaboration with Michels [26], can surely no longer 
be ignored by writers on scaling issues and psychophysical laws. Finally, 
developments in information theory, decision theory, utility theory, and 
game theory also have bearing upon scaling methods and psychophysical 
laws, as well as a number of other psychological problems. 


Psychophysical Laws 

No quantitative psychological law has generated as much controversy 
and has had such a long life as Fechner’s psychophysical law. A hundred 
years later, Luce and Edwards [23] give mathematical reasons for believing 
that Fechner’s law was improperly derived from the Weber law. They object 
to his use of a differential equation, not a new kind of objection, by any means. 

The operational increment Ay (y for psychological) that is associated 
with an increment A¢ (¢ for physical) is usually defined as the jnd, where 
two stimuli are correctly discriminated 75 percent of the time, two judgments 
being permitted. It would seem, intuitively, at least, that one could let the 
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standard of the increment Ay decrease with standards approaching 50 per- 
cent as a limit but making Ay as small as one pleases, Ad decreasing ac- 
cordingly, thus supporting the use of Fechner’s differential equation. The 
equality of the psychological increments, under any probability standard, 
however, would rest on the assumption of Thurstone’s case V (equal dis- 
criminal dispersions and zero intercorrelations) or Mosteller’s [29] case Va 
(equal dispersions and equal correlations). 

Stevens [32] rejects Fechner’s law as being ‘‘erroneous” because jnd’s 
are not equal. Stevens [30] has proposed that for what he calls prothetic 
psychological scales, ‘‘the’’ law is ¥ = bd", where y and ¢ are the psychological 
and physical quantities, respectively, b is a scaling factor, and the exponent 
n depends upon the kind of variables involved. Prothetic scales involve 
quantitative judgments, pertaining to such variables as loudness and bright- 
ness, in short, intensive dimensions. Stevens has marshalled a great amount 
of evidence to support this choice of law and others have provided additional 
supporting data.* In fact, the power psychophysical law relating y to @ 
was first proposed in recent years by me and Dingman [14]. We shall come 
back to the power law after considering problems of scaling. 


Psychological Scaling 


It has been common knowledge among those concerned with psycho- 
logical-scaling methods that when different kinds of judgments are applied 
to the same stimuli, the resulting scales are likely to be related to one another 
in nonlinear ways. The type of psychophysical function that fits the data 
depends upon the kind of judgment involved—comparative, categorical, or 
ratio. Ratio judgments give scales that fit the power law, and there is often 
a positive acceleration of ¥ as a function of @. Category judgments commonly 
give scales that fit a logarithmic function. Jnd scales, formed by cumulating 
jnd increments of y as a function of ¢, commonly show more negative accele- 
ration than do category scales. 

Not all of these scales can be ‘‘correct.’’ All of them could, of course, 
be incorrect. Stevens [30, 33] adopts the results from ratio scaling as being 
correct. His best support for this preference comes from impressive experi- 
mental results showing that two ratio-scaled modalities bear linear relation- 
ship to one another, the slope of which is the ratio of the two n exponents 
for those two modalities. Garner [8], however, almost as vigorously rejects 
ratio scales in favor of jnd scales. 

What is one to believe? Are we to be left with a purely operational view, 
accepting different psychophysical laws for different scaling methods? Is 
there any way to bridge the gaps between the different kinds of scales? In 


“Studies from the University of Stockholm indicate that in order to make the power 
function fit data one must commonly substitute the quantity (¢ — $0), where ¢o is de- 
termined from the data [7]. Stevens [33] reports one such instance. 
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the face of discrepancies like this, it is not so much a question of which is 
correct, but of how can we reconcile the differences. Until we do, we have 
not learned as much about the phenomena as we should know. A more 
comprehensive theory is needed. Certain developments are promising in this 
regard. The adaptation-level theory of Helson and Michels has some strong 
unifying potentialities. In fact, a few years ago, Michels and Helson [27] 
demonstrated how the results from ratio scaling can be harmonized with 
their own version of Fechner’s law. 

Recently, Luce [21] has demonstrated mathematically that if y and @ 
are both measured on ratio scales the only possible kind of psychophysical 
law is a power function, and if y is measured on a category scale the only 
possible kind of law is logarithmic. It would seem that if we could only 
locate a genuine zero point on a category scale, we would thereby have a 
ratio scale, and the power law should then apply. It would seem however, 
that it would require more than a shift of zero point to go from a logarithmic 
function of @ to a power function of ¢, expecially when the exponent is 
greater than unity. 

I have often felt that if either comparative judgment scaling or cate- 
gory judgment scaling were done taking into account changes in discriminal 
dispersions, the same psychophysical law might apply in all three kinds of 
scaling [11]. This would require the assumption of something other than 
case V of either Thurstone’s law of comparative judgment or Torgerson’s 
law of categorical judgment [36]. As a matter of fact, Bjérkman [3] has very 
recently demonstrated that where discriminal dispersions change syste- 
matically with stimulus level, we have a basis for going mathematically 
from one kind of scale to another. 

It should be clear by 1960 that scaling results are a function of a great 
many conditions. Even with the same kind of judgments and the same 
kind of stimuli, the empirical psychophysical function will vary depending 
upon the circumstances. Among these circumstances are the general level 
of the experimental stimuli on their stimulus continuum, their range and 
their distribution, the presence of simultaneous stimuli, of immediately 
preceding stimuli, and even of more remotely preceding stimuli, the preceding 
judgment or judgments, and the individual observer. 

Ratio-scaling methods are by no means exempt from such effects. Baker 
and Dudek [2] show a considerable range of the exponent n in the power 
function for scales of stimulus weights from different investigators, subjects, 
and judging conditions. McGill [25] reports considerable variation from person 
to person, where scales are constructed from judgments of single observers. 
He suggests that such differences may arise from the fact that each observer 
has his own conception of numbers and quantities and how to match the two. 
Torgerson [37] makes a similar suggestion. He points out that an observer 
may take two different attitudes toward quantity. An increment in ¢ may 
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be taken either as a linear change or a proportional change. Thus, ratio 
judgments may be to some extent contaminated by interval estimates, and 
interval judgments may be similarly contaminated by ratio estimates. 
Similarly, there are probably confoundings of comparative and category 
judgments. These problems have as yet not been specifically investigated. 

There are two special results from ratio scaling that are worthy of notice. 
Warren and Warren [39] found that if stimulus weights are varied roughly 
in size as they vary in weight, to control the size-weight illusion, the exponent 
n in the power function is essentially unity. A weight half as heavy as another 
physically feels half as heavy. Baker and Dudek [2] cite an instance in which 
size was completely controlled as a cue, with a resulting n also essentially 
unity. 

Such results have led Warren [38] to propose a “‘physical-correlate”’ 
theory of ratio judgments. He maintains that ratio judgments are not actually 
made with respect to y quantities but rather with respect to ¢ quantities, 
as we have learned to associate sensation with perceived physical values. 
The old-timers among you will recognize Warren's ‘‘physical-correlates” 
phenomenon as a case of Titchener’s ‘‘stimulus error,’’ a case in which the 
error is complete. 

Is the ratio method peculiarly susceptible to the stimulus error? Is it 
possible that the departure of the exponent n from unity in each case is an 
indication of the extent to which the observer lacks experience with the 
particular kind of stimuli? Stevens [33] gives a table of values of n for a 
number of different modalities. It is interesting that their mean proves to 
be 1.0004, if two questionable values are omitted. One case omitted is es- 
sentially a repetition of another, the case of loudness judgments, and the 
other is for electric shock. The exponent of 3.5 for electric shock stands out 
by itself, the reason for which may be that more than a sensory continuum 
may be involved, including pain and anxiety. 

At any rate, an average exponent of unity might be interpreted as 
supporting Warren’s hypothesis. On the other hand, the exponents (excluding 
the one for electric shock) range from 0.3 to 1.7, a fact that any hypothesis 
like Warren’s would have to explain. Certainly, there is insufficient support 
for his curious conclusion that the measurement of psychological quantities, 
as such, is impossible. It does not seem reasonable that in all modalities 
such as odors, taste qualities, and vibrations, the observer has had sufficient 
experience in associating sensations with physical calibrations to build up 
dependable associations for judgment purposes. Also against the theory are 
results of experiments [5] showing that ratio scales can be derived for stimuli 
for which we know no physical equivalents, e.g., roughness of sandpaper 
end preferences for neckties. 

At this point, some consideration of the basic nature of psychophysical 
judgments may help. The judgment given by the observer is a verbal reaction. 
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He reports equal to, greater than, more distant than, half as great as, and 
so on. Such reactions are not to be confused with the observer’s perceptions 
or feelings. It is this difference that led the writer to point out a logical 
distinction between two psychological continua, the response continuum 
(perceptual) and the judgment continuum (non-perceptual) [10]. Most psy- 
- chophysical theory has been in terms of relating quantities (R) on the response 
continuum to quantities (S) on the stimulus continuum. Operationally, we 
have related quantities (J) on the judgment continuum (or values derived 
from judgments) to quantities on the stimulus continuum. Only in the case 
of linear correspondence of judgment and response quantities will our experi- 
mental results yield fully correct conclusions regarding response quantities. 

Incidentally, the psychophysical judgment phenomenon can be related 
to concepts in the writer’s “‘structure of intellect’’ [12]. In that model, dis- 
tinctions are made among three kinds of content of three varieties of infor- 
mation—figural (perceived), symbolic (numbers or letters), and semantic 
(verbal). In a psychophysical experiment, stimuli first produce items of 
figural information. These items have real positions on one or more response 
continua, depending upon the number of dimensions of experience involved. 
We are usually interested in one of these dimensions. In accordance with the 
experimenter’s instructions, the observer’s task is to make a translation of 
figural into semantic content; he gives a verbal response. The various psy- 
chophysical scaling methods are designed to make the further translation 
into symbolic content, to assign numbers. Sometimes the observer is asked 
to make the translation himself from figural into symbolic content, as in 
rating on a numerical scale or in giving a ratio judgment in numerical form. 
His success will depend upon his conception of numbers and their properties 
and upon how to associate them with psychological quantities. 


Adaptation Level 


Earlier, I spoke of the great amount of relativity there is in connection 
with psychophysical judgments and consequently with scaling. Helson’s 
concept of adaptation level is the only theoretical position that attempts 
to take into account the many sources of relativity. There is insufficient 
space here to give an adequate discussion of this important theory and its 
implications. A few comments will have to suffice. 

At the moment of stimulation, the observer is said to have a preestab- 
lished ‘‘neutral’’ point on the psychological continuum concerned. With a 
certain set of stimuli of a certain modality used in an experiment, one certain 
stimulus quantity is most likely to elicit a judgment of “medium.” This 
stimulus is at the adaptation level. If the stimuli are weights, the judgment 
at the adaptation level A is neither “heavy” nor “‘light’’; if they are sounds 
to be judged for pitch, judgments at level A are neither “‘high’’ nor “low.” 
The adaptation-level stimulus is conceived quantitatively as a weighted 
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geometric mean of all such experimental stimuli, all such background stimuli 
(contiguous in space or time), and all such previously experienced stimuli, 
each set with its appropriate weight. 

One feature of the adaptation-level concept, which is often overlooked, 
is the assumption of bipolarity of the judgment scale. Stimuli have 
positive or negative properties according as they are above or below A. 
The adaptation level is a most important reference point for the whole scale. 
A consequence is that A replaces the stimulus limen (S,) as an important 
landmark. For example Fechner’s law has traditionally been applied using 
the stimulus limen as the unit of the physical scale. Michels and Helson [26] 
have arrived at a new logarithmic law, using the adaptation level as the 
unit of the physical scale. 

Helson and others have generalized the adaptation-level concept con- 
siderably, applying it to motivation, learning, and social behavior [16]. 
Stevens [31] has objected strenuously to the application of a concept that 
was developed in connection with judgments of sensory phenomena to other 
areas of behavior. Stevens wants to restrict the term adaptation to sensory 
phenomena. He suggests that in dealing with non-sensory data we speak 
instead of judgmental relativity. 

I see no reason why a psychologist may not use a technical term such 
as adaptation as broadly as he cares to define it. Certainiy there is much 
basis for recognizing that similar principles apply to both sensory and non- 
sensory aspects of behavior. Principles of very broad application indicate 
fundamental coherence in a science, and comprehensive envisagement of 
the science. 

But in fairness to Stevens, I should say that he has put his finger on a 
problem of theory that Helson and Michels have overlooked. They have 
overlooked it because they could do so without harm to their general concept 
and the accompanying principles. The two views can be easily reconciled if 
both Stevens and Helson recognize explicitly the distinction between the 
judgment continuum and the response continuum. The recognition is implicit , 
in Stevens suggestion of judgmental relativity. 

But he should also recognize that there is a perceptual relativity. That is, 
for the same stimulus value there are systematic changes in response quan- 
tities, in addition to the chance fluctuations postulated in Thurstone’s concept 
of discriminal dispersion. The same weight does not feel as heavy one time 
as another, and this is not always a sampling fluctuation. In other words, 
there is an actual shift in R; in response to S; . There is also a relativity of 
judgment quantity in conjunction with the same R; value, the relativity that 
Stevens recognizes. That is, the observer readjusts his set of semantic labels 
to meet changing circumstances. Tones all called “high” in one experiment 
may all be called “‘low’’ in another. The observer may well recognize that 
he has made such a shift. 
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In any one experiment, where judgmental effects are produced consequent 
to a change of adaptation level, it would be difficult to say how much of these 
effects are due to a systematic shift in R and how much of the effects are 
due to a systematic shift in J. Helson and Michels do not face this problem 
because they developed their theory in terms of the judgment continuum, 
ignoring the response continuum. For a complete theory it would seem desira- 
ble to bring the response continuum into the picture. 


Finite Mathematical Models 


The models utilizing finite mathematics have in common the fact that 
probabilities are somehow involved. The data are at least in part categorical. 
Such models include stochastic theories of learning, information theory, 
decision theory, utility theory, and game theory. 

I cannot begin to do justice to-stochastic learning theory here. I would 
comment only that in spite of their elegance such theories will probably 
have relatively limited application. There is, first, the fact that when we 
get beyond the two-choice problem things become rapidly more complicated. 
More importantly, they are limited because the stimulus-response model 
that has been prevalent in behavioristic psychology is limited. It is time that 
we went beyond the stimulus-response model. The writer has elsewhere 
proposed [13] a type of psychology that emphasizes information as a major 
concept, in keeping with which a quite new approach to learning theory 
will be needed. 

Information theory came on the psychological scene during the past 
decade. A recent convenient introduction to psychological applications is 
presented by Attneave [1]. General impressions leave some scattered evalua- 
tive conclusions. The procedures of information measurement are in some 
respects merely alternatives to some common statistical procedures. Amount 
of information is analogous to variance. Amount of transmitted information 
is analogous to correlation. Which way of treating data is to be preferred, 
informational or statistical, would depend upon what questions we ask of 
the data and the kind of theory with which we start investigations. 

Information concepts and methods have been used with refreshing 
enlightenment in some areas of psychological investigation. For example, 
they have provided a badly needed way of quantifying the degree to which 
a perceived object is organized or patterned; how much system there is 
in it. The concept of channel capacity has been employed to determine how 
many categories of absolute judgment one could use without errors of clas- 
sification [28]. Correlation procedures can also be used effectively, however, 
to obtain indices of accuracy of judgment ((10], p. 325). Learning data 
have also been treated by the operations of information measurement. 

Psychology needs a considerable extension and variation of information 
theory to meet its peculiar needs. The type of informational psychology that 
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I am proposing elsewhere [13] would seem to call naturally for information- 
measurement treatment. In that connection information is defined as “that 
which the organism discriminates.’’ From the structure-of-intellect theory, 
four basic kinds of information must be dealt with—figural, symbolic, seman- 
tic, and behavioral. Behavioral information, not yet defined in this paper, 
includes the perceptions, thoughts, desires, intentions, and so on, in ourselves 
and in others. The first major problem will be to find ways of encoding all 
aspects of every kind of information for use in measurement. Limitations 
to success in present efforts at categorizing figural and semantic information 
are already apparent. It may be that those limitations will be overcome. 

Decision theory has been developed first by economists, who are naturally 
concerned with the question of how a person reacts to two or more attractive 
alternatives. This application immediately implies the subject of measurement 
of psychological values. Luce [21] has recently developed an axiomatic 
foundation for choice behavior and has applied the results to psychological 
scaling, psychophysical laws, utility, and learning [see also 6]. Cronbach 
and Gleser [4] have also applied principles of decision theory to the practical 
problems of evaluation of tests in connection with personnel selection, place- 
ment, and classification. We shall probably hear a great deal more from the 
decision and utility approaches. They should be particularly adapted to 
important motivational problems. 

Game theory apparently grew out of decision and utility theory, since 
games involve choices, in this case choices of action. There is the added aspect 
of risk involved. Game theory has hardly touched psychology as yet, but 
applications to social behavior would seem to be so obviously needed that 
we cannot expect this situation to endure. There may also be applications 
to personal choices of strategies in problem solving. Luce and Raiffa [23] 
have presented a clear exposition of the game-theory approach. 


Some Knuckle Rapping 


The picture of psychological measurement at large is not entirely a 
rosy one. I cannot refrain from pointing out a few symptoms of psychometric 
pathology. These symptoms pertain to consumers or users of methods rather 
than to producers, I must be quick to say. Some of these pertain to the 
use of ratings and others to the use of factor analysis. The points mentioned 
are only a few that happen to have been forced upon my attention in recent 
years. I am sure that many readers could add to the list. 


Misuse of Statistics 


I have previously [9] spoken out against the permissibility of computing 
coefficients of correlation between variables that are represented by ipsative 
scores as if those scores are normative. Many investigators forget that 
ipsative scores from systematically forced-choice instruments tend to enforce 
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equal means and variances within individuals, a condition that introduces 
errors when we want to know how each person stands on each trait in a 
population. For example, scores from instruments such as the Kuder Prefer- 
ence Record, the Edwards Personal Preference Schedule, and the Allport- 
Vernon-Lindzey Study of Values, should not be correlated, without modifi- 
cation, with one another or with other variables. Such correlations are largely 
meaningless. 

The need for self-restraint also applies to tests of statistical significance, 
including ¢ tests and F tests applied to ipsative scores. It is very doubtful 
that the sampling conditions that justify such statistical tests are satisfied. 
Statistics from such scores reported in the literature are of doubtful meaning, 
to say the least. 

Another common failing, even with normative scores, is to stretch the 
meaning of significant r’s, t’s, or F’s much too far. An example of this malady 
recently came to my attention in an advertisement for a new test. The 
test was claimed to be valid because the correlations between its scores and 
those of an older test of reputedly the same traits were statistically significant, 
the actual coefficients of correlation not being stated. Let us give the writer 
of this advertisement the benefit of having no conscious intention to mislead 
the reader. But he should have remembered that a significant correlation 
could still be very low if the sample size is large. The confusion of sampling 
statistics with descriptive statistics is by no means restricted to this case. 

We are all grateful to those statisticians who have developed the many 
tests of significance and the experimental designs to go with them. They 
have undoubtedly toned up the logic of experimental research a great deal. 
Yet it is sometimes rather depressing to see the extent to which these pro- 
cedures are overemphasized at the expense of psychological thinking. To 
some investigators, their numbers are limited, fortunately, it is as if beautiful 
designs and statistical tests become an end rather than a means to an end. 

I recall reading a doctoral dissertation, the source of which will remain 
anonymous, except to say that it was not from my University. The student 
started out fairly well, generating a problem from psychological theory. But 
from there on, his primary concern was with his experimental design, which 
was very complex, and with appropriate use of statistics. He demonstrated 
nothing of any psychological importance. He had done an exercise in statistical 
manipulations, but his dissertation was not even much of a contribution to 
methodology. A quotation from H. H. Kendler ({19], p. 79) sums up the 
matter: ‘‘Complicated experimental designs involving complex statistical 
procedures seem to be offered in lieu of theoretical notions.” 

Related to this situation is an overuse of dichotomies in experimental 
research. The availability of analysis-of-variance techniques is largely to 
blame for this. Usually the investigator is studying the functional dependence 
of one continuous variable on another (sometimes you would suspect that 
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he does not realize this fact). He arbitrarily chooses two levels, perhaps 
three, on one or more variables and determines whether there are significant 
differences on the dependent variable in relation to one or more independent 
variables. There is apparently little or no interest in the form of the regressions, 
whether they are linear or not, or even whether monotonic or not. Many 
an insignificant difference may be due to a seriously nonlinear or nonmono- 
tonic regression. An obvious remedy for this situation is a healthier respect 
for correlational approaches, for which there are also significance tests. 


Use of Rating Methods 


Experience with ratings has led me to a general distrust of them, except 
in the case of judgments of single psychological dimensions that are associ- 
ated with experimentally controlled stimulus dimensions, and except for 
certain applications of check-list ratings of observed behavior. As objects 
of observation, human beings are the most complex stimuli that we can 
rate. Yet they are probably the most commonly rated objects. The old 
saying, ‘‘Where ignorance is bliss, etc.’’ is very apropos. If an investigator 
does not wish to know how badly his ratings have gone astray, he should 
not subject them to correlational analysis. Validity of ratings should not 
be taken for granted, as they so often are, if the research is to be at all serious. 
Analysis will show that in addition to the various kinds of errors with which 
we are all acquainted and which are known to affect reliability, there are a 
number of other errors that also affect validity. Raters commonly confuse 
dimensions. Instructed to rate along one dimension they rate off on some 
other dimension. In the evaluation of complex stimuli, when there is no other 
method available we may rate. But let us find out all we can about the nature 
of the ratings. 


Factor Analysis 


Factor analysis continues to be abused. A few years ago I wrote an 
article [9] entitled ‘‘When not to factor analyze.” There is little evidence 
that it has had much effect. 

The place of factor analysis as a research instrument is not adequately 
realized. Two different uses are common, but the distinction between them 
is not sufficiently appreciated. One use is that of reducing data to simpler 
terms for a better cognitive grasp of those particular data—probably nothing 
more. In a universe of data it is usually possible to envisage all variations 
in terms of a smaller number of dimensions than there are experimental 
variables. An example of this use of factor analysis would be in connection 
with ratings of a sample of personnel in terms of k defined variables. Inter- 
correlations of such ratings are almost always substantial to high but of 
uneven size, indicating some degree of structuring with probably more than 
one dimension involved. The result is a kind of classification of the rating 
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variables. With insight into the experimental situation, the investigator may 
be able to rationalize the factors. The factors probably do not represent 
anything of a fundamental psychological nature. They may tell us as much 
about the semantic habits of the raters as about the properties of personnel. 

The second major use of factor analysis is to discover fundamental 
psychological variables, such as unique traits, traits each of which cannot be 
fully accounted for by any other trait or traits discovered by the same process. 
As Thurstone so often emphasized, such a study should begin with psycho- 
logical hypotheses, and the analyzed variables are tailor-made for the purpuse 
of the analysis. In this use, one does not throw together, helter-skelter, just 
any combination of tests or other instruments that are convenient or available. 
The factors from such an analysis are less likely to coincide with psychologi- 
cally meaningful dimensions. The choice of material is one of the most 
important experimental controls available to the factor analyst. 

The rotational problem has bothered many psychologists and others. 
Those with a strong mathematical conscience insist upon a completely ob- 
jective analytical rotation or none. If such kind of rotation were successful 
from the psychological point of view it would be fine. Unfortunately it rarely 
is. The reason is that we can never know enough in advance of the analysis 
to include just the right variables to represent each common factor. We 
should have to know as much as we do after the analysis, perhaps more. The 
sampling of experimental variables is all-important for achieving rotation 
mechanically to satisfy some rigorous mathematical criterion. If we knew 
enough for this purpose in advance we should hardly have to do the analysis 
at all. 

And while I am on the subject of computations, I should like to express 
a warning against a new disease—‘‘computeritis.’’ High-speed computers are 
wonderful; but they cannot do all the thinking for the psychologist and 
they may conceal information as well as reveal it. 


Some Future Needs in Psychological Measurement 


Before closing, I should like to adopt a futuristic attitude toward the 
field of psychological measurement, not to make predictions but to suggest 
a few directions of effort that seem to be needed. In general, the survey of 
recent and current developments in the field indicate that the creature is 
lively and in general good health. There appear to be a few deficiencies, 
however, detracting from a picture of perfect health. These deficiencies 
pertain to neglected areas of scaling, particularly in the measurement of 
performances, to areas not yet sufficiently explored by quantitative, rational 
approaches, and to areas which lack experimental checking where there has 
been such rationalization. 

In the research areas of learning and problem solving, particularly, more 
attention needs to be given to rational scales for the assessment of behavior. 
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Performances have been measured too often by the more available aspects 
that can be readily quantified, without enough consideration as to whether 
those measures are the ones needed to test hypotheses and theories. Basic 
theory is stated with regard to latent variables such as “habit strength,” 
but our experimental assessments are of manifest variables such as pro- 
portion of correct responses, response time, and resistance to extinction. 
Humphreys [18] has made a beginning in the study of such manifest scores 
by factor analysis. Others [40, 17] have tried out various transformations 
of such information in an effort to achieve more rational scales. We need 
more of these kinds of effort. 

Quantitative rationale has been developed in some areas of psychological 
research and not in others. Some of the favored areas have been psychological 
scaling, testing, learning, neurology, and social behavior. Relatively neglected 
areas are motivation, emotion, affective values (or utility, if you prefer), 
attitudes and attitude changes, and thinking in its various forms. 

Although new rationalization is needed in all these areas, another need 
is more experimental follow-up to test such hypothetico-deductive beginnings 
as have been made. As anyone knows, this phase of research is much more 
time consuming than the development of the theory. Often the originator 
of the mathematical theory derives much more satisfaction from going on 
to new mathematical problems or has less skill in theory-testing research. 
Other investigators, who perhaps have more taste for and more skill in the 
latter, do not carry on from the theory. Although the latter investigators 
might profitably find their problems in connection with quantitative theories 
developed by others, communication between the two groups of investigators 
is not always adequate. Many quantitative theories in the literature remain 
without any experimental testing. 

One reason for this is that the average investigator lacks sufficient 
training in mathematics. The day has surely come when all students who 
are being trained for research in psychology, basic or technological, must 
gain an adequate background in mathematics. This should include not only 
the calculus but also matrix algebra and probability with its related fields 
included under finite mathematics. Let us hope that the next 25 years will 
be an exponential extension of the period just passed and that the record 
at the time of our golden anniversary will show that this goal has been satisfied, 
if not exceeded. 
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